Coding Event Data with GPT

The problem

Define your broad corpus

Find a set of keywords that appear in your corpus.

This could include:

Event terms, for example: “protest”, “vote”, “bombing”
Actor names or identifiers, for example: “Biden”, “LTTE”, “Prime Minister”

Collect your broad corpus

Collect all articles or text sources that include your keywords.

Sources could include:

APIs
Web scraping

My broad corpus

library(tidyverse)

all_articles <- read_csv(here::here("reports", "data", "all_articles.csv"))
all_articles

# A tibble: 49,343 × 7
   document_document_id  source_name title publication_date    text  probability
   <chr>                 <chr>       <chr> <dttm>              <chr>       <dbl>
 1 7V2B-8WR1-2PBV-B4HN-… The Associ… Sovi… 1991-04-02 19:00:00 "Whe…       0.517
 2 3TDD-S850-0031-V23G-… Agence Fra… Firs… 1991-09-08 20:00:00 "The…       0.883
 3 3SJ4-DY20-0007-W197-… RusData Di… BALT… 1991-09-17 20:00:00 "WAS…       0.929
 4 49R6-55B0-01VR-924M-… CNN.com     Afgh… 1991-11-10 19:00:00 "A d…       0.966
 5 3SJ4-D9W0-0008-C28H-… The Associ… Afgh… 1991-11-14 19:00:00 "The…       0.904
 6 4D07-C0T0-009F-R125-… The Associ… paki… 1991-11-15 19:00:00 "pak…       0.708
 7 3TDD-S2B0-0031-V1N2-… Agence Fra… Afgh… 1991-11-17 19:00:00 "Afg…       0.664
 8 549W-G371-JBTF-631F-… The Nation  Kabu… 1991-12-20 19:00:00 "The…       0.691
 9 41FB-23C0-0041-709H-… Official K… Kabu… 1991-12-20 19:00:00 "The…       0.690
10 3TDD-RYF0-0031-V1N5-… Agence Fra… Mosc… 1991-12-21 19:00:00 "In …       0.674
# ℹ 49,333 more rows
# ℹ 1 more variable: dispute <chr>

Whittle this broad corpus down

You now need to work out which of these articles actually include information on your events and actors.

We are going to ask GPT to tell us the following:

Identify with ‘yes’ or ‘no’ whether the following article mentions a meeting or event involving representatives of at least two countries, or of at least one country and organization, in which they discuss a conflict involving at least one of those countries.

Building your prompt

Your prompt needs to be:

Specific,
Concise,
Simple.

Working with a single article

Let’s head over to ChatGPT and see this classification task in action:

How did it go?

Working with many articles

This is great, but what if you have many, many articles that you need to code?

Let’s use R to help us out!

For each article, we want to:

Build a prompt,
Run a GPT model of our choice against that prompt,
Record its response in a data frame.

Building your prompt

First, we need to start with our base prompt:

base_prompt <- "Identify with 'yes' or 'no' whether the following article (delimited in XML tags) mentions a meeting or event involving representatives of at least two countries, or of at least one country and organization, in which they discuss a conflict involving at least one of those countries: <article>{article_body}</article>"
base_prompt

[1] "Identify with 'yes' or 'no' whether the following article (delimited in XML tags) mentions a meeting or event involving representatives of at least two countries, or of at least one country and organization, in which they discuss a conflict involving at least one of those countries: <article>{article_body}</article>"

Building your prompt

Next, we need to include our article text:

article_body <- all_articles |> 
  # Select the first article
  slice(1) |> 
  # Pull out the text
  pull(text)
article_body

[1] "When Yelena Khanga first visited the United States five years ago, she told people she was Russian.\n\n \n\n    \"Everyone was very polite,\" Khanga told a small gathering of students and teachers Tuesday at the University of South Florida (USF). \"They treated me like a guest.\"\n\n   But for her next visit, Khanga, who is black, decided to keep quiet.\n\n \n\n    \"Now . . . I just try to pass like a black American,\" said Khanga, 25.  She said: \"I think I know what it is to be black American in United States.\"\n\n \n\n    Khanga, a journalist from the Russianrepublic of the Soviet Union, lives in New York City. She spoke Tuesday at the University of Tampa about her life. In addition, Khanga spoke at USF and filmed a segment of The Bridge, a WUSF-TV talk show.\n\n \n\n    She spoke with zeal about many aspects of Russian glasnost, feminism and the KGB among them.\n\n \n\n    She also spoke about race. In the United States, she said, she had the feeling of being treated differently because of her race. \"I won't get into specifics,\" she said.\n\n \n\n    On the other hand, in her native land, Khanga said, she thinks few race relations problems surface because, unlike the United States, Russia does not have a history of African slavery.\n\n \n\n    Although she talked about serious tensions between different Z ethnic groups, Khanga said the country, particularly Moscow, has many interracial families. She said a number of Russians are of African descent.\n\n \n\n    Some at the lecture said they find it hard to conceive of such a racially integrated society.\n\n \n\n    \"I'm not so sure,\" said Troy Collier, an associate dean of students at USF. \"I suspect there is a race problem in Russia.\" Collier said he thinks color makes a difference, especially to the people who control a country.\n\n \n\n    Khanga is the great-grandaughter of an ex-slave who lived in Mississippi. Her grandfather attended Tuskegee Institute in Alabama and moved to New York where he met and married Khanga's grandmother, a Jewish woman.\n\n \n\n    The couple was among a group of 16 black families who left the United States in the 1930s to escape the Depression and find what they hoped would be fairer treatment under communism.\n\n \n\n    The families settled in Uzbekistan,near the border with Afghanistan, primarily because the Moslems in the area were people of color. Some of the black families remained in the area and some eventually returned to the United States.\n\n \n\n    Khanga's mother and father, who was born in Tanzania, lived in Uzbekistan. Khanga attended Moscow State University. Since graduating, she has worked six years for the Moscow News, a major Soviet Union newspaper, and she has visited the United States eight times.\n\n \n\n    Two months ago, after extensive research, Khanga visited the site of her great-grandfather's home in Mississippi.\n\n \n\n    \"It was so exciting to know that she was . . . Russian,\" said Wanda Lewis Campbell, an associate dean of students at USF, \"but at the same time she looks like one of my cousins.\""

Building your prompt

Finally, we need to add this article body into our prompt:

article_prompt <- glue::glue(base_prompt)
article_prompt

Identify with 'yes' or 'no' whether the following article (delimited in XML tags) mentions a meeting or event involving representatives of at least two countries, or of at least one country and organization, in which they discuss a conflict involving at least one of those countries: <article>When Yelena Khanga first visited the United States five years ago, she told people she was Russian.

 

    "Everyone was very polite," Khanga told a small gathering of students and teachers Tuesday at the University of South Florida (USF). "They treated me like a guest."

   But for her next visit, Khanga, who is black, decided to keep quiet.

 

    "Now . . . I just try to pass like a black American," said Khanga, 25.  She said: "I think I know what it is to be black American in United States."

 

    Khanga, a journalist from the Russianrepublic of the Soviet Union, lives in New York City. She spoke Tuesday at the University of Tampa about her life. In addition, Khanga spoke at USF and filmed a segment of The Bridge, a WUSF-TV talk show.

 

    She spoke with zeal about many aspects of Russian glasnost, feminism and the KGB among them.

 

    She also spoke about race. In the United States, she said, she had the feeling of being treated differently because of her race. "I won't get into specifics," she said.

 

    On the other hand, in her native land, Khanga said, she thinks few race relations problems surface because, unlike the United States, Russia does not have a history of African slavery.

 

    Although she talked about serious tensions between different Z ethnic groups, Khanga said the country, particularly Moscow, has many interracial families. She said a number of Russians are of African descent.

 

    Some at the lecture said they find it hard to conceive of such a racially integrated society.

 

    "I'm not so sure," said Troy Collier, an associate dean of students at USF. "I suspect there is a race problem in Russia." Collier said he thinks color makes a difference, especially to the people who control a country.

 

    Khanga is the great-grandaughter of an ex-slave who lived in Mississippi. Her grandfather attended Tuskegee Institute in Alabama and moved to New York where he met and married Khanga's grandmother, a Jewish woman.

 

    The couple was among a group of 16 black families who left the United States in the 1930s to escape the Depression and find what they hoped would be fairer treatment under communism.

 

    The families settled in Uzbekistan,near the border with Afghanistan, primarily because the Moslems in the area were people of color. Some of the black families remained in the area and some eventually returned to the United States.

 

    Khanga's mother and father, who was born in Tanzania, lived in Uzbekistan. Khanga attended Moscow State University. Since graduating, she has worked six years for the Moscow News, a major Soviet Union newspaper, and she has visited the United States eight times.

 

    Two months ago, after extensive research, Khanga visited the site of her great-grandfather's home in Mississippi.

 

    "It was so exciting to know that she was . . . Russian," said Wanda Lewis Campbell, an associate dean of students at USF, "but at the same time she looks like one of my cousins."</article>

Run a GPT model against this prompt

The end goal:

library(httr2)

req <- request("https://api.openai.com/v1/chat/completions") |>
  req_headers("Content-Type" = "application/json",
              "Authorization" = paste("Bearer", Sys.getenv("OPENAI_API_KEY_NSF"))) |>
  req_body_json(
    list(
      "model" = "gpt-3.5-turbo-0613",
      "messages" = list(
        list(
          "role" = "system",
          "content" = "You are a helpful assistant."
        ),
        list(
          "role" = "user",
          "content" = prompt
        )
      ),
      "temperature" = 0,
      "max_tokens" = 1
    )
  ) |>
  req_retry(max_tries = 3)

Building your API request

We are going to take advantage of the fantastic httr2 R package to work with the OpenAI API.

First, we need to build our request to the API. You need:

The API endpoint,
The content type with which you would like to work,
Your authorization to use the API,
Your chosen GPT model,
Your model parameters.

The API endpoint

The endpoint depends on the type of GPT model you want to use.

For classification tasks, we have two options:

Chat completion models,
Completion models.

The API endpoint

The base URL for chat completion models is:

https://api.openai.com/v1/chat/completions

The base URL for completion models is:

https://api.openai.com/v1/completions

The API endpoint

req <- request("https://api.openai.com/v1/chat/completions") |>
  req_headers("Content-Type" = "application/json",
              "Authorization" = paste("Bearer", Sys.getenv("OPENAI_API_KEY_NSF"))) |>
  req_body_json(
    list(
      "model" = "gpt-3.5-turbo-0613",
      "messages" = list(
        list(
          "role" = "system",
          "content" = "You are a helpful assistant."
        ),
        list(
          "role" = "user",
          "content" = prompt
        )
      ),
      "temperature" = 0,
      "max_tokens" = 1
    )
  ) |>
  req_retry(max_tries = 3)

httr2::request() takes that URL as its first argument.

Content type

Most modern APIs are stored using JSON, which is a very light-weight way of sharing data.

Source: Stack Overflow

Content type

req <- request("https://api.openai.com/v1/chat/completions") |>
  req_headers("Content-Type" = "application/json",
              "Authorization" = paste("Bearer", Sys.getenv("OPENAI_API_KEY_NSF"))) |>
  req_body_json(
    list(
      "model" = "gpt-3.5-turbo-0613",
      "messages" = list(
        list(
          "role" = "system",
          "content" = "You are a helpful assistant."
        ),
        list(
          "role" = "user",
          "content" = prompt
        )
      ),
      "temperature" = 0,
      "max_tokens" = 1
    )
  ) |>
  req_retry(max_tries = 3)

Authorization

It is not free to use GPT.

You will need:

A subscription,
An API key.

Authorization

You should never hard code an API key into an R script.

Instead, save it in your R environment:

Sys.setenv("OPENAI_API_DEMO" = "XXXXXXXXXXXXXXXXXXXXXX")

Now you can use the API key without writing it out directly:

Sys.getenv("OPENAI_API_DEMO")

[1] "XXXXXXXXXXXXXXXXXXXXXX"

Authorization

req <- request("https://api.openai.com/v1/chat/completions") |>
  req_headers("Content-Type" = "application/json",
              "Authorization" = paste("Bearer", 
                                      Sys.getenv("OPENAI_API_PERSONAL"))) |>
  req_body_json(
    list(
      "model" = "gpt-3.5-turbo-0613",
      "messages" = list(
        list(
          "role" = "system",
          "content" = "You are a helpful assistant."
        ),
        list(
          "role" = "user",
          "content" = prompt
        )
      ),
      "temperature" = 0,
      "max_tokens" = 1
    )
  ) |>
  req_retry(max_tries = 3)

Selecting your GPT model

There are many different families of GPT models:

Source: OpenAI API documentation

Selecting your GPT model

We are working with chat completion models:

gpt-4
gpt-4 turbo
gpt-3.5-turbo

Selecting your GPT model

req <- request("https://api.openai.com/v1/chat/completions") |>
  req_headers("Content-Type" = "application/json",
              "Authorization" = paste("Bearer", 
                                      Sys.getenv("OPENAI_API_PERSONAL"))) |>
  req_body_json(
    list(
      "model" = "gpt-3.5-turbo",
      "messages" = list(
        list(
          "role" = "system",
          "content" = "You are a helpful assistant."
        ),
        list(
          "role" = "user",
          "content" = prompt
        )
      ),
      "temperature" = 0,
      "max_tokens" = 1
    )
  ) |>
  req_retry(max_tries = 3)

Specifying the role we want GPT to play

GPT can play many different roles. You can be very creative here and specify exactly what role you would like GPT to play.

For example, you can ask GPT to respond like:

An academic colleague,
A reviewer,
Shakespeare.

Specifying the role we want GPT to play

req <- request("https://api.openai.com/v1/chat/completions") |>
  req_headers("Content-Type" = "application/json",
              "Authorization" = paste("Bearer", 
                                      Sys.getenv("OPENAI_API_PERSONAL"))) |>
  req_body_json(
    list(
      "model" = "gpt-3.5-turbo",
      "messages" = list(
        list(
          "role" = "system",
          "content" = "You are a helpful assistant."
        ),
        list(
          "role" = "user",
          "content" = prompt
        )
      ),
      "temperature" = 0,
      "max_tokens" = 1
    )
  ) |>
  req_retry(max_tries = 3)

Include your prompt

req <- request("https://api.openai.com/v1/chat/completions") |>
  req_headers("Content-Type" = "application/json",
              "Authorization" = paste("Bearer", 
                                      Sys.getenv("OPENAI_API_PERSONAL"))) |>
  req_body_json(
    list(
      "model" = "gpt-3.5-turbo",
      "messages" = list(
        list(
          "role" = "system",
          "content" = "You are a helpful assistant."
        ),
        list(
          "role" = "user",
          "content" = prompt
        )
      ),
      "temperature" = 0,
      "max_tokens" = 1
    )
  ) |>
  req_retry(max_tries = 3)

Specify your model parameters: `temperature`

The temperature parameter controls how random the model output will be.

It takes a value between 0 and 2, where 2 is very random and 0 is not random.

For classification tasks, we want a straightforward “Yes” or “No” answer. In other words, we want no randomness.

Specify your model parameters: `temperature`

req <- request("https://api.openai.com/v1/chat/completions") |>
  req_headers("Content-Type" = "application/json",
              "Authorization" = paste("Bearer", 
                                      Sys.getenv("OPENAI_API_PERSONAL"))) |>
  req_body_json(
    list(
      "model" = "gpt-3.5-turbo",
      "messages" = list(
        list(
          "role" = "system",
          "content" = "You are a helpful assistant."
        ),
        list(
          "role" = "user",
          "content" = prompt
        )
      ),
      "temperature" = 0,
      "max_tokens" = 1
    )
  ) |>
  req_retry(max_tries = 3)

Specify your model parameters: `max_tokens`

A token represents a group of characters (sometimes whole words) that is meaningful to the GPT model.

Source: OpenAI API documentation

Specify your model parameters: `max_tokens`

The max_tokens parameter sets the maximum number of tokens the output can produce.

We want “Yes” or “No”. These are represented by one token.
You can check out how many tokens your output will be using the OpenAI Tokenizer.

Specify your model parameters: `max_tokens`

req <- request("https://api.openai.com/v1/chat/completions") |>
  req_headers("Content-Type" = "application/json",
              "Authorization" = paste("Bearer", 
                                      Sys.getenv("OPENAI_API_PERSONAL"))) |>
  req_body_json(
    list(
      "model" = "gpt-3.5-turbo",
      "messages" = list(
        list(
          "role" = "system",
          "content" = "You are a helpful assistant."
        ),
        list(
          "role" = "user",
          "content" = prompt
        )
      ),
      "temperature" = 0,
      "max_tokens" = 1
    )
  ) |>
  req_retry(max_tries = 3)

Specify your model parameters

There are many different parameters you can control when using chat completion models.

Check out the full list in the OpenAI documentation.

Make your requests more robust

Sometimes, your request will fail. You can ask R to retry your request using httr2::req_retry().

req <- request("https://api.openai.com/v1/chat/completions") |>
  req_headers("Content-Type" = "application/json",
              "Authorization" = paste("Bearer", 
                                      Sys.getenv("OPENAI_API_PERSONAL"))) |>
  req_body_json(
    list(
      "model" = "gpt-3.5-turbo",
      "messages" = list(
        list(
          "role" = "system",
          "content" = "You are a helpful assistant."
        ),
        list(
          "role" = "user",
          "content" = prompt
        )
      ),
      "temperature" = 0,
      "max_tokens" = 1
    )
  ) |>
  req_retry(max_tries = 3)

Run the GPT model for one article

We are ready to make that API request!

We have our prompt:

article_prompt

Identify with 'yes' or 'no' whether the following article (delimited in XML tags) mentions a meeting or event involving representatives of at least two countries, or of at least one country and organization, in which they discuss a conflict involving at least one of those countries: <article>When Yelena Khanga first visited the United States five years ago, she told people she was Russian.

 

    "Everyone was very polite," Khanga told a small gathering of students and teachers Tuesday at the University of South Florida (USF). "They treated me like a guest."

   But for her next visit, Khanga, who is black, decided to keep quiet.

 

    "Now . . . I just try to pass like a black American," said Khanga, 25.  She said: "I think I know what it is to be black American in United States."

 

    Khanga, a journalist from the Russianrepublic of the Soviet Union, lives in New York City. She spoke Tuesday at the University of Tampa about her life. In addition, Khanga spoke at USF and filmed a segment of The Bridge, a WUSF-TV talk show.

 

    She spoke with zeal about many aspects of Russian glasnost, feminism and the KGB among them.

 

    She also spoke about race. In the United States, she said, she had the feeling of being treated differently because of her race. "I won't get into specifics," she said.

 

    On the other hand, in her native land, Khanga said, she thinks few race relations problems surface because, unlike the United States, Russia does not have a history of African slavery.

 

    Although she talked about serious tensions between different Z ethnic groups, Khanga said the country, particularly Moscow, has many interracial families. She said a number of Russians are of African descent.

 

    Some at the lecture said they find it hard to conceive of such a racially integrated society.

 

    "I'm not so sure," said Troy Collier, an associate dean of students at USF. "I suspect there is a race problem in Russia." Collier said he thinks color makes a difference, especially to the people who control a country.

 

    Khanga is the great-grandaughter of an ex-slave who lived in Mississippi. Her grandfather attended Tuskegee Institute in Alabama and moved to New York where he met and married Khanga's grandmother, a Jewish woman.

 

    The couple was among a group of 16 black families who left the United States in the 1930s to escape the Depression and find what they hoped would be fairer treatment under communism.

 

    The families settled in Uzbekistan,near the border with Afghanistan, primarily because the Moslems in the area were people of color. Some of the black families remained in the area and some eventually returned to the United States.

 

    Khanga's mother and father, who was born in Tanzania, lived in Uzbekistan. Khanga attended Moscow State University. Since graduating, she has worked six years for the Moscow News, a major Soviet Union newspaper, and she has visited the United States eight times.

 

    Two months ago, after extensive research, Khanga visited the site of her great-grandfather's home in Mississippi.

 

    "It was so exciting to know that she was . . . Russian," said Wanda Lewis Campbell, an associate dean of students at USF, "but at the same time she looks like one of my cousins."</article>

Run the GPT model for one article

We are ready to make that API request!

Which we can insert into our fully fleshed out request:

req <- request("https://api.openai.com/v1/chat/completions") |>
  req_headers("Content-Type" = "application/json",
              "Authorization" = paste("Bearer", 
                                      Sys.getenv("OPENAI_API_PERSONAL"))) |>
  req_body_json(
    list(
      "model" = "gpt-3.5-turbo",
      "messages" = list(
        list(
          "role" = "system",
          "content" = "You are a helpful assistant."
        ),
        list(
          "role" = "user",
          "content" = article_prompt
        )
      ),
      "temperature" = 0,
      "max_tokens" = 1
    )
  ) |>
  req_retry(max_tries = 3)

Run the GPT model for one article

Now we just need to perform that request:

resp <- req_perform(req)

And see the response:

resp

$method
[1] "POST"

$url
[1] "https://api.openai.com/v1/chat/completions"

$status_code
[1] 200

$headers
$date
[1] "Sat, 18 Nov 2023 18:26:01 GMT"

$`content-type`
[1] "application/json"

$`access-control-allow-origin`
[1] "*"

$`cache-control`
[1] "no-cache, must-revalidate"

$`openai-model`
[1] "gpt-3.5-turbo-0613"

$`openai-organization`
[1] "user-plgttstfzcy8uauk33qyaogz"

$`openai-processing-ms`
[1] "630"

$`openai-version`
[1] "2020-10-01"

$`strict-transport-security`
[1] "max-age=15724800; includeSubDomains"

$`x-ratelimit-limit-requests`
[1] "3500"

$`x-ratelimit-limit-tokens`
[1] "90000"

$`x-ratelimit-limit-tokens_usage_based`
[1] "90000"

$`x-ratelimit-remaining-requests`
[1] "3499"

$`x-ratelimit-remaining-tokens`
[1] "89146"

$`x-ratelimit-remaining-tokens_usage_based`
[1] "89146"

$`x-ratelimit-reset-requests`
[1] "17ms"

$`x-ratelimit-reset-tokens`
[1] "568ms"

$`x-ratelimit-reset-tokens_usage_based`
[1] "568ms"

$`x-request-id`
[1] "418dd991b2eb21aa03d15269d45790e9"

$`cf-cache-status`
[1] "DYNAMIC"

$`set-cookie`
[1] "__cf_bm=oQeTwtEc9Ed80gvIWthQ_2X_9U4u47we2sUnjMSWKZw-1700331961-0-AXUpeq0lT3lpre+GDYvMLPz6388fNhJsDbhz/m1RWlCobQISAdhVIr2ayAJzst1fBl2vDb3TNMaM509KMVczpjc=; path=/; expires=Sat, 18-Nov-23 18:56:01 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None"

$`set-cookie`
[1] "_cfuvid=tlRgQ9nTsEuv0TijI2vENdniVKFlUkQ2vUx7_zT0C64-1700331961712-0-604800000; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None"

$server
[1] "cloudflare"

$`cf-ray`
[1] "82824263b9a07fe1-IAD"

$`content-encoding`
[1] "gzip"

$`alt-svc`
[1] "h3=\":443\"; ma=86400"

attr(,"class")
[1] "httr2_headers"

$body
  [1] 7b 0a 20 20 22 69 64 22 3a 20 22 63 68 61 74 63 6d 70 6c 2d 38 4d 4b 49 7a
 [26] 54 71 68 59 68 51 4e 74 75 64 41 70 75 61 37 64 4d 51 4d 4d 70 44 30 67 22
 [51] 2c 0a 20 20 22 6f 62 6a 65 63 74 22 3a 20 22 63 68 61 74 2e 63 6f 6d 70 6c
 [76] 65 74 69 6f 6e 22 2c 0a 20 20 22 63 72 65 61 74 65 64 22 3a 20 31 37 30 30
[101] 33 33 31 39 36 31 2c 0a 20 20 22 6d 6f 64 65 6c 22 3a 20 22 67 70 74 2d 33
[126] 2e 35 2d 74 75 72 62 6f 2d 30 36 31 33 22 2c 0a 20 20 22 63 68 6f 69 63 65
[151] 73 22 3a 20 5b 0a 20 20 20 20 7b 0a 20 20 20 20 20 20 22 69 6e 64 65 78 22
[176] 3a 20 30 2c 0a 20 20 20 20 20 20 22 6d 65 73 73 61 67 65 22 3a 20 7b 0a 20
[201] 20 20 20 20 20 20 20 22 72 6f 6c 65 22 3a 20 22 61 73 73 69 73 74 61 6e 74
[226] 22 2c 0a 20 20 20 20 20 20 20 20 22 63 6f 6e 74 65 6e 74 22 3a 20 22 4e 6f
[251] 22 0a 20 20 20 20 20 20 7d 2c 0a 20 20 20 20 20 20 22 66 69 6e 69 73 68 5f
[276] 72 65 61 73 6f 6e 22 3a 20 22 6c 65 6e 67 74 68 22 0a 20 20 20 20 7d 0a 20
[301] 20 5d 2c 0a 20 20 22 75 73 61 67 65 22 3a 20 7b 0a 20 20 20 20 22 70 72 6f
[326] 6d 70 74 5f 74 6f 6b 65 6e 73 22 3a 20 37 38 31 2c 0a 20 20 20 20 22 63 6f
[351] 6d 70 6c 65 74 69 6f 6e 5f 74 6f 6b 65 6e 73 22 3a 20 31 2c 0a 20 20 20 20
[376] 22 74 6f 74 61 6c 5f 74 6f 6b 65 6e 73 22 3a 20 37 38 32 0a 20 20 7d 0a 7d
[401] 0a

attr(,"class")
[1] "httr2_response"

GPT’s response

Let’s break down this response:

resp

$method
[1] "POST"

$url
[1] "https://api.openai.com/v1/chat/completions"

$status_code
[1] 200

$headers
$date
[1] "Sat, 18 Nov 2023 18:26:01 GMT"

$`content-type`
[1] "application/json"

$`access-control-allow-origin`
[1] "*"

$`cache-control`
[1] "no-cache, must-revalidate"

$`openai-model`
[1] "gpt-3.5-turbo-0613"

$`openai-organization`
[1] "user-plgttstfzcy8uauk33qyaogz"

$`openai-processing-ms`
[1] "630"

$`openai-version`
[1] "2020-10-01"

$`strict-transport-security`
[1] "max-age=15724800; includeSubDomains"

$`x-ratelimit-limit-requests`
[1] "3500"

$`x-ratelimit-limit-tokens`
[1] "90000"

$`x-ratelimit-limit-tokens_usage_based`
[1] "90000"

$`x-ratelimit-remaining-requests`
[1] "3499"

$`x-ratelimit-remaining-tokens`
[1] "89146"

$`x-ratelimit-remaining-tokens_usage_based`
[1] "89146"

$`x-ratelimit-reset-requests`
[1] "17ms"

$`x-ratelimit-reset-tokens`
[1] "568ms"

$`x-ratelimit-reset-tokens_usage_based`
[1] "568ms"

$`x-request-id`
[1] "418dd991b2eb21aa03d15269d45790e9"

$`cf-cache-status`
[1] "DYNAMIC"

$`set-cookie`
[1] "__cf_bm=oQeTwtEc9Ed80gvIWthQ_2X_9U4u47we2sUnjMSWKZw-1700331961-0-AXUpeq0lT3lpre+GDYvMLPz6388fNhJsDbhz/m1RWlCobQISAdhVIr2ayAJzst1fBl2vDb3TNMaM509KMVczpjc=; path=/; expires=Sat, 18-Nov-23 18:56:01 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None"

$`set-cookie`
[1] "_cfuvid=tlRgQ9nTsEuv0TijI2vENdniVKFlUkQ2vUx7_zT0C64-1700331961712-0-604800000; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None"

$server
[1] "cloudflare"

$`cf-ray`
[1] "82824263b9a07fe1-IAD"

$`content-encoding`
[1] "gzip"

$`alt-svc`
[1] "h3=\":443\"; ma=86400"

attr(,"class")
[1] "httr2_headers"

$body
  [1] 7b 0a 20 20 22 69 64 22 3a 20 22 63 68 61 74 63 6d 70 6c 2d 38 4d 4b 49 7a
 [26] 54 71 68 59 68 51 4e 74 75 64 41 70 75 61 37 64 4d 51 4d 4d 70 44 30 67 22
 [51] 2c 0a 20 20 22 6f 62 6a 65 63 74 22 3a 20 22 63 68 61 74 2e 63 6f 6d 70 6c
 [76] 65 74 69 6f 6e 22 2c 0a 20 20 22 63 72 65 61 74 65 64 22 3a 20 31 37 30 30
[101] 33 33 31 39 36 31 2c 0a 20 20 22 6d 6f 64 65 6c 22 3a 20 22 67 70 74 2d 33
[126] 2e 35 2d 74 75 72 62 6f 2d 30 36 31 33 22 2c 0a 20 20 22 63 68 6f 69 63 65
[151] 73 22 3a 20 5b 0a 20 20 20 20 7b 0a 20 20 20 20 20 20 22 69 6e 64 65 78 22
[176] 3a 20 30 2c 0a 20 20 20 20 20 20 22 6d 65 73 73 61 67 65 22 3a 20 7b 0a 20
[201] 20 20 20 20 20 20 20 22 72 6f 6c 65 22 3a 20 22 61 73 73 69 73 74 61 6e 74
[226] 22 2c 0a 20 20 20 20 20 20 20 20 22 63 6f 6e 74 65 6e 74 22 3a 20 22 4e 6f
[251] 22 0a 20 20 20 20 20 20 7d 2c 0a 20 20 20 20 20 20 22 66 69 6e 69 73 68 5f
[276] 72 65 61 73 6f 6e 22 3a 20 22 6c 65 6e 67 74 68 22 0a 20 20 20 20 7d 0a 20
[301] 20 5d 2c 0a 20 20 22 75 73 61 67 65 22 3a 20 7b 0a 20 20 20 20 22 70 72 6f
[326] 6d 70 74 5f 74 6f 6b 65 6e 73 22 3a 20 37 38 31 2c 0a 20 20 20 20 22 63 6f
[351] 6d 70 6c 65 74 69 6f 6e 5f 74 6f 6b 65 6e 73 22 3a 20 31 2c 0a 20 20 20 20
[376] 22 74 6f 74 61 6c 5f 74 6f 6b 65 6e 73 22 3a 20 37 38 32 0a 20 20 7d 0a 7d
[401] 0a

attr(,"class")
[1] "httr2_response"

But where is the prediction?

Welcome to the JSON rabbit hole…

resp_body_json(resp)

$id
[1] "chatcmpl-8MKIzTqhYhQNtudApua7dMQMMpD0g"

$object
[1] "chat.completion"

$created
[1] 1700331961

$model
[1] "gpt-3.5-turbo-0613"

$choices
$choices[[1]]
$choices[[1]]$index
[1] 0

$choices[[1]]$message
$choices[[1]]$message$role
[1] "assistant"

$choices[[1]]$message$content
[1] "No"


$choices[[1]]$finish_reason
[1] "length"



$usage
$usage$prompt_tokens
[1] 781

$usage$completion_tokens
[1] 1

$usage$total_tokens
[1] 782

But where is the prediction?

Welcome to the JSON rabbit hole…

resp_body_json(resp)$choices

[[1]]
[[1]]$index
[1] 0

[[1]]$message
[[1]]$message$role
[1] "assistant"

[[1]]$message$content
[1] "No"


[[1]]$finish_reason
[1] "length"

But where is the prediction?

Welcome to the JSON rabbit hole…

resp_body_json(resp)$choices[[1]]$message

$role
[1] "assistant"

$content
[1] "No"

But where is the prediction?

Welcome to the JSON rabbit hole…

resp_body_json(resp)$choices[[1]]$message$content

[1] "No"

Save the prediction

pred <- resp_body_json(resp)$choices[[1]]$message$content

df <- tibble(
    body = article_body,
    gpt_pred = pred
  )

df

# A tibble: 1 × 2
  body                                                                  gpt_pred
  <chr>                                                                 <chr>   
1 "When Yelena Khanga first visited the United States five years ago, … No

Congratulations!

You have now used an advanced large language model to identify whether an event is referenced in a news article.

Let’s set you up to do that across all 49,343 articles.

Building your article reader function

End goal:

article_classification <- function(article_body) {
  
  # Create your prompt
  article_prompt <- glue::glue("Identify with 'yes' or 'no' whether the following article (delimited in XML tags) mentions a meeting or event involving representatives of at least two countries, or of at least one country and organization, in which they discuss a conflict involving at least one of those countries: <article>{article_body}</article>")
  
  # Build your request
  req <- request("https://api.openai.com/v1/chat/completions") |>
  req_headers("Content-Type" = "application/json",
              "Authorization" = paste("Bearer", 
                                      Sys.getenv("OPENAI_API_PERSONAL"))) |>
  req_body_json(
    list(
      "model" = "gpt-3.5-turbo",
      "messages" = list(
        list(
          "role" = "system",
          "content" = "You are a helpful assistant."
        ),
        list(
          "role" = "user",
          "content" = article_prompt
        )
      ),
      "temperature" = 0,
      "max_tokens" = 1
    )
  ) |>
  req_retry(max_tries = 3)
  
  # Perform your request
  resp <- req_perform(req)
  
  # Clean up the response
  pred <- resp_body_json(resp)$choices[[1]]$message$content
  
  # Save the response
  df <- tibble(
    body = article_body,
    gpt_pred = pred
  )
  
  return(df)
  
}

Running our function across our articles

labelled_articles <- map(
  1:5, 
  ~ all_articles |>
      slice(.x) |>
      pull(text) |>
      article_classification()
) |> 
  bind_rows()

The result

labelled_articles

# A tibble: 5 × 2
  body                                                                  gpt_pred
  <chr>                                                                 <chr>   
1 "When Yelena Khanga first visited the United States five years ago, … No      
2 "The first Soviet delegation to visit Afghanistan since the failed h… Yes     
3 "WASHINGTON - President George Bush met Tuesday with the presidents … Yes     
4 "A delegation of Afghan mujahedeen rebels met Russian Vice-President… Yes     
5 "The chief guerrilla delegate at talks on ending Afghanistan's civil… Yes

Some tips: token limits

There are (very large) character limits for your prompts.

For gpt-3.5-turbo, it is 4,096 tokens.

A token is roughly four characters (in English).

Some tips: token limits

To make sure your prompts do not go over this limit, add this:

article_classification <- function(article_body) {
  
  # Create your prompt
  article_prompt <- glue::glue("Identify with 'yes' or 'no' whether the following article (delimited in XML tags) mentions a meeting or event involving representatives of at least two countries, or of at least one country and organization, in which they discuss a conflict involving at least one of those countries: <article>{article_body}</article>")
  
  if (nchar(article_prompt) / 4 > 4096) {
    
    # Build your request
    req <- request("https://api.openai.com/v1/chat/completions") |>
    req_headers("Content-Type" = "application/json",
                "Authorization" = paste("Bearer", 
                                        Sys.getenv("OPENAI_API_PERSONAL"))) |>
    req_body_json(
      list(
        "model" = "gpt-3.5-turbo",
        "messages" = list(
          list(
            "role" = "system",
            "content" = "You are a helpful assistant."
          ),
          list(
            "role" = "user",
            "content" = article_prompt
          )
        ),
        "temperature" = 0,
        "max_tokens" = 1
      )
    ) |>
    req_retry(max_tries = 3)
  
    # Perform your request
    resp <- req_perform(req)
  
    # Clean up the response
    pred <- resp_body_json(resp)$choices[[1]]$message$content
  
    # Save the response
    df <- tibble(
      body = article_body,
      gpt_pred = pred
   )
  
    return(df)
    
  } else {
    
    stop("Prompt exceeds token limit.")
    
  }
  
}

Some tips: clean your input

You will often have junk in your text (for example, paragraph delimiters or news agency bylines). Removing this:

Increases the likelihood that you won’t reach the token limit,
Reduces your use costs,
Can produce more accurate results.

Some tips: evaluating your model

You should check whether or not the model is correctly identifying relevant articles.

Select some labelled articles at random and hand code them.
Evaluate how accurate your model is performing.

Coding Event Data with GPT

The problem

Define your broad corpus

Collect your broad corpus

My broad corpus

Whittle this broad corpus down

Building your prompt

Working with a single article

How did it go?

Working with many articles

Building your prompt

Building your prompt

Building your prompt

Run a GPT model against this prompt

Building your API request

The API endpoint

The API endpoint

The API endpoint

Content type

Content type

Authorization

Authorization

Authorization

Selecting your GPT model

Selecting your GPT model

Selecting your GPT model

Specifying the role we want GPT to play

Specifying the role we want GPT to play

Include your prompt

Specify your model parameters: temperature

Specify your model parameters: temperature

Specify your model parameters: max_tokens

Specify your model parameters: max_tokens

Specify your model parameters: max_tokens

Specify your model parameters

Make your requests more robust

Run the GPT model for one article

Run the GPT model for one article

Run the GPT model for one article

GPT’s response

But where is the prediction?

But where is the prediction?

But where is the prediction?

But where is the prediction?

Save the prediction

Congratulations!

Building your article reader function

Running our function across our articles

The result

Some tips: token limits

Some tips: token limits

Some tips: clean your input

Some tips: evaluating your model

Specify your model parameters: `temperature`

Specify your model parameters: `temperature`

Specify your model parameters: `max_tokens`

Specify your model parameters: `max_tokens`

Specify your model parameters: `max_tokens`