Note: analyses are based on code from Steve Rathje

Load Libraries and API Key

In order to call ChatGPT, you’ll need to generate an API key. To do that, log into OpenAI Platform and click on “API Keys.” Select “Create a new secret key” and copy the code generated. That will be the key you enter here.

library(httr)
library(stringr)
library(dplyr)
library(svMisc)
library(purrr)
library(tidyverse)

my_API <- "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"

Example function

Here, we’re checking to see if we’ve successfully connected to GPT by asking GPT why NYC pizza tastes better. Note that this might take a minute.

chatGPT_response <- POST(
  # use chatGPT website (you can copy paste)
  url = "https://api.openai.com/v1/chat/completions",
  # Authorize
  add_headers(Authorization = paste("Bearer", my_API)),
  # Output type: use JSON
  content_type_json(),
  # encode the value to json format
  encode = "json",
  # Controlling what to show as the output, it's going to be a list of following things
  body = list(
    model = "gpt-4-0125-preview", # GPT model
    messages = list(list(role = "user", content = "Why does NYC pizza taste better?"))
  )
)

# Selecting the portion we want to display
answer_one <- content(chatGPT_response)$choices[[1]]$message$content
# cleaning the selected output
answer_one <- stringr::str_trim(answer_one)
# Printing the message as a character string
cat(answer_one)

Using GPT to code a dataset

To run that same process through our data, we’re going to convert those two functions into one function.

# Asking Questions to ChatGPT, Saving and Cleaning Answer
hey_chatGPT <- function(answer_my_question) {
  chat_GPT_answer <- POST(
    url = "https://api.openai.com/v1/chat/completions",
    add_headers(Authorization = paste("Bearer", my_API)),
    content_type_json(),
    encode = "json",
    body = list(
      model = "gpt-4-0125-preview",
      messages = list(
        list(
          role = "user",
          content = answer_my_question
        )
      )
    )
  )
  str_trim(content(chat_GPT_answer)$choices[[1]]$message$content)
}

Test our new function

response <- hey_chatGPT("What are the difference between R and Python?")
cat(response)

Real analyses

Because I want to use GPT for coding, I’m setting temperature to 0. The higher the temp is, the more creative you’re allowing GPT to be. This can be helpful for generating images, but it hasn’t worked well for my purposes.

Feel free to play with this. Some people prefer 0.1-0.2.

chatGPT_0temp <- function(answer_my_question) {
  chat_GPT_answer <- POST(
    url = "https://api.openai.com/v1/chat/completions",
    add_headers(Authorization = paste("Bearer", my_API)),
    content_type_json(),
    encode = "json",
    body = list(
      model = "gpt-4-0125-preview",  
      temperature = 0,  # specifying 0 temp to limit gpt "creativity"

      messages = list(
        list(
          role = "user",
          content = answer_my_question
        )
      )
    )
  )
 str_trim(content(chat_GPT_answer)$choices[[1]]$message$content)
}

Load the dataframe

For GPT to enter new values into your dataframe, you’ll need to first create a blank variable to direct it where to go. I’m making three new variables so I can compare GPT’s coding against itself. You can also make multiple columns if you want GPT to give you multiple responses within one dataframe.

df<- read.csv("/Users/kareenadelrosario/Desktop/morality study/miis_text_analysis/miis_clean_df_withallnegotiationoutcomes.csv")

data <- df %>% 
  transmute(Dyad.ID, text=negotiationLog) # selecting the two necessary variables and renaming the text column

# Create a "gpt" column
data$gpt <- NA
data$gpt2 <- NA
data$gpt3 <- NA

Coding with custom function

First I’m going to do a test run with a subset of my data.

# let's limit this to 30 dyads
data_30 <- data %>% 
  dplyr::slice_min(order_by = Dyad.ID, n=60) %>% 
  distinct(Dyad.ID, .keep_all = TRUE)


# coding for persuasiveness - gpt 1    
   for (i in 1:nrow(data_30)) {
  print(i)
  question <- "Rate persuasiveness of only the lines that start with 'A:' in the negotiation on a 1 to 7 scale. Each line starts with <br>. Answer ONLY with a number. If there are no lines that start with 'A:' write 0. To help rate from 1-7, use these guidelines:
1 out of 7 would be something like, oh I think we should go for this offer (no reasoning/argument)
4 out of 7 would be the participant restating what's on their individual role sheet. Ex: I like the global firm because since they're international, they can look at applicants from all over the world.
7 out of 7 would be the participant having their own reasoning, argument, etc. A 7 should make you think, oh that's a good argument. 
Dont be afraid to use all the scale numbers -- please dont shy away from 1 and 7.  Here is the text:"
  text <- data_30[i, 2]  # Assuming the text is in the second column of the df
  concat <- paste(question, text)
  result <- chatGPT_0temp(concat)
while(length(result) == 0){
    result <- chatGPT_0temp(concat)
    print(result)
  }
   print(result) 
  data_30$gpt[i] <- result # make sure this matches your df$variable
   }

# gpt 2
   for (i in 1:nrow(data_30)) {
  print(i)
  question <- "Rate persuasiveness of only the lines that start with 'A:' in the negotiation on a 1 to 7 scale. Each line starts with <br>. Answer ONLY with a number. If there are no lines that start with 'A:' write 0. To help rate from 1-7, use these guidelines:
1 out of 7 would be something like, oh I think we should go for this offer (no reasoning/argument)
4 out of 7 would be the participant restating what's on their individual role sheet. Ex: I like the global firm because since they're international, they can look at applicants from all over the world.
7 out of 7 would be the participant having their own reasoning, argument, etc. A 7 should make you think, oh that's a good argument. 
Dont be afraid to use all the scale numbers -- please dont shy away from 1 and 7.  Here is the text:"
  text <- data_30[i, 2]  # Assuming the text is in the second column of the df
  concat <- paste(question, text)
  result <- chatGPT_0temp(concat)
while(length(result) == 0){
    result <- chatGPT_0temp(concat)
    print(result)
  }
   print(result) 
  data_30$gpt2[i] <- result # make sure this matches your df$variable
   }

# gpt 3
   for (i in 1:nrow(data_30)) {
  print(i)
  question <- "Rate persuasiveness of only the lines that start with 'A:' in the negotiation on a 1 to 7 scale. Each line starts with <br>. Answer ONLY with a number. If there are no lines that start with 'A:' write 0. To help rate from 1-7, use these guidelines:
1 out of 7 would be something like, oh I think we should go for this offer (no reasoning/argument)
4 out of 7 would be the participant restating what's on their individual role sheet. Ex: I like the global firm because since they're international, they can look at applicants from all over the world.
7 out of 7 would be the participant having their own reasoning, argument, etc. A 7 should make you think, oh that's a good argument. 
Dont be afraid to use all the scale numbers -- please dont shy away from 1 and 7.  Here is the text:"
  text <- data_30[i, 2]  # Assuming the text is in the second column of the df
  concat <- paste(question, text)
  result <- chatGPT_0temp(concat)
while(length(result) == 0){
    result <- chatGPT_0temp(concat)
    print(result)
  }
   print(result) 
  data_30$gpt3[i] <- result # make sure this matches your df$variable
   }

#write.csv(data_30, "gptdemo_data_30.csv", row.names = FALSE)

Let’s test the interreliability within ChatGPT

# remove this line in your analyses
data_30 <- read.csv("gptdemo_data_30.csv")

library(psych)

data_30 <- data_30 %>% 
   dplyr::mutate(across(starts_with("gpt"), ~as.numeric(as.character(.)))) 

icc_gpt <- ICC(data_30[,-1])  # Assuming the first column is `Dyad.ID`

print(icc_gpt)
Call: ICC(x = data_30[, -1])

Intraclass correlation coefficients 
                         type  ICC   F df1 df2       p lower bound upper bound
Single_raters_absolute   ICC1 0.68 7.4  29  60 4.0e-11        0.50        0.82
Single_random_raters     ICC2 0.68 7.5  29  58 4.8e-11        0.51        0.82
Single_fixed_raters      ICC3 0.68 7.5  29  58 4.8e-11        0.51        0.82
Average_raters_absolute ICC1k 0.87 7.4  29  60 4.0e-11        0.75        0.93
Average_random_raters   ICC2k 0.87 7.5  29  58 4.8e-11        0.75        0.93
Average_fixed_raters    ICC3k 0.87 7.5  29  58 4.8e-11        0.76        0.93

 Number of subjects = 30     Number of Judges =  3
See the help file for a discussion of the other 4 McGraw and Wong estimates,

Referencing ICC3, we see that GPT’s ICC = 0.685, p < .001.

Now, let’s compare it to humans.

Human vs Machine

Note that the human coder ratings are averaged across two coders. We’ll separate them for a different analysis later.

humancoders <- read.csv("humancoder_comparison_mar2024.csv") # human df

# we're still using a subset
comparison <- inner_join(humancoders, data_30, by = "Dyad.ID") %>% # combine df
  select(-c(X)) %>% 
 dplyr::mutate(across(starts_with("gpt"), ~as.numeric(as.character(.))))

icc_results <- ICC(comparison[,-1])  # Assuming the first column is `Dyad.ID`

print(icc_results)
Call: ICC(x = comparison[, -1])

Intraclass correlation coefficients 
                         type  ICC   F df1 df2       p lower bound upper bound
Single_raters_absolute   ICC1 0.57 6.4  29  90 5.5e-12        0.40        0.74
Single_random_raters     ICC2 0.58 6.9  29  87 1.3e-12        0.40        0.74
Single_fixed_raters      ICC3 0.59 6.9  29  87 1.3e-12        0.42        0.75
Average_raters_absolute ICC1k 0.84 6.4  29  90 5.5e-12        0.73        0.92
Average_random_raters   ICC2k 0.84 6.9  29  87 1.3e-12        0.73        0.92
Average_fixed_raters    ICC3k 0.85 6.9  29  87 1.3e-12        0.74        0.92

 Number of subjects = 30     Number of Judges =  4
See the help file for a discussion of the other 4 McGraw and Wong estimates,

Now let’s use the full datasets.

data_full <- data %>% 
  distinct(Dyad.ID, .keep_all = TRUE)

   for (i in 1:nrow(data_full)) {
  print(i)
  question <- "Rate persuasiveness of only the lines that start with 'A:' in the negotiation on a 1 to 7 scale. Each line starts with <br>. Answer ONLY with a number. If there are no lines that start with 'A:' write 0. To help rate from 1-7, use these guidelines:
1 out of 7 would be something like, oh I think we should go for this offer (no reasoning/argument)
4 out of 7 would be the participant restating what's on their individual role sheet. Ex: I like the global firm because since they're international, they can look at applicants from all over the world.
7 out of 7 would be the participant having their own reasoning, argument, etc. A 7 should make you think, oh that's a good argument. 
Dont be afraid to use all the scale numbers -- please dont shy away from 1 and 7.  Here is the text:"
  text <- data_full[i, 2]  # Assuming the text is in the second column of the df
  concat <- paste(question, text)
  result <- chatGPT_0temp(concat)
while(length(result) == 0){
    result <- chatGPT_0temp(concat)
    print(result)
  }
   print(result) 
  data_full$gpt[i] <- result # make sure this matches your df$variable
   }

# because I'm "knitting" this RMD, I don't want to totally rerun these analyses. I'm going to read in my output instead. Just know that I did actually run this.

# write.csv(data_full, "gptdemo_data_full.csv", row.names = FALSE)
# remove this line when you actually run the code.
data_full <- read.csv("gptdemo_data_full.csv")

# combine with humancoder data
comp_full <- inner_join(humancoders, data_full, by = "Dyad.ID") %>% # combine df
  select(-c(X, text, gpt2, gpt3)) %>% 
 dplyr::mutate(across(starts_with("gpt"), ~as.numeric(as.character(.))))

icc_full <- ICC(comp_full[,-1])  

print(icc_full)
Call: ICC(x = comp_full[, -1])

Intraclass correlation coefficients 
                         type  ICC   F df1 df2       p lower bound upper bound
Single_raters_absolute   ICC1 0.57 3.7 160 161 9.4e-16        0.46        0.67
Single_random_raters     ICC2 0.59 4.5 160 160 7.0e-20        0.40        0.71
Single_fixed_raters      ICC3 0.63 4.5 160 160 7.0e-20        0.53        0.72
Average_raters_absolute ICC1k 0.73 3.7 160 161 9.4e-16        0.63        0.80
Average_random_raters   ICC2k 0.74 4.5 160 160 7.0e-20        0.57        0.83
Average_fixed_raters    ICC3k 0.78 4.5 160 160 7.0e-20        0.69        0.84

 Number of subjects = 161     Number of Judges =  2
See the help file for a discussion of the other 4 McGraw and Wong estimates,

When looking at the ICCs between GPT and human coders across all of the data, we get ICC = 0.634, p < .001.

