Introduction

Large Language Models (LLMs) like GPT-4 have revolutionized how we communicate and understand information. In this activity, we’ll explore how to leverage LLMs in R for Data Science using the openai package.

We’ll start by playing with OpenAI API for free-text and structured responses. Then, we’ll use LLMs to answer - and grade - textual questions and answers.

If you’d like to learn more about LLMs in R, check out the documentation for the excellent ellmer package that we will be using. It was written by Hadley Wickham, who created the tidyverse.

Objectives:

Prerequisites

0. Setup

Download the Reasoning 20k dataset and place the combined_reasoning.json file somewhere you can find it.

# Install and load required packages
# install.packages("ellmer");
# install.packages("kableExtra")
# install.packages("jsonlite")


library(ellmer);
library(tidyverse);
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(furrr); # for parallel map
## Loading required package: future
library(kableExtra);
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(jsonlite)
## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:purrr':
## 
##     flatten

Utility function to format tables nicely:

# Function to truncate text in all string (character) columns of a data frame
format_table <- function(df, max_length = 150) {
  head_df <- data.frame(df %>% head(5))
  # Function to truncate individual text entries
  truncate_text <- function(text, max_length) {
    text <- gsub("[\r\n]", "", text)
    return (
      ifelse(nchar(text) > max_length, 
             paste0(substr(text, 1, max_length), "..."), 
             text)
    )
  }
  
  # Loop over all columns that are character type and apply truncation
  for (col in colnames(head_df)) {
    if (is.character(head_df[[col]])) {
      head_df[[col]] <- sapply(head_df[[col]], truncate_text, max_length = max_length)
    }
  }
  
  # Return the modified data frame
  head_df %>%
    kbl() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size=12)
}
# Set your OpenAI API key. Shilad or your instructor will give this to you.
# Sys.setenv(OPENAI_API_KEY = "YOUR API KEY")

1. Experimenting with the OpenAI API

Our interaction with the LLM will be through the OpenAI Chat Completions API. This API allows us to interact with the LLM by providing prompts and receiving completions. As of 2024, this API is by far the most popular way to interact with LLMs. In this activity we will use this API to ask questions, generate text, and even grade responses.

Simple text completion

To begin, let’s start with a simple question answering task using the GPT-4o-mini model. We are going to encapsulate the question answering logic in a function called completion. It effectively asks the LLM for an answer to a question.

simple_completion <- function(prompt) {
  chat <- chat_openai(model = "gpt-4.1-mini");
  return (chat$chat(prompt, echo = FALSE));
}

Task 1: Play with simple text completion and reflect on results

Experiment with the function by asking a simple question as shown in the example below. Change the question below to several that you are interested in or have expertise related to. How does it perform? What does it do well? What does it do poorly? Put your example questions and analysis in the code below.

# Ask a few questions. Do you notice any differences with "chat gpt" responses?
simple_completion("Why is the sky blue?");
## The sky appears blue because of a phenomenon called **Rayleigh scattering**. 
## When sunlight enters Earth's atmosphere, it is made up of different colors of 
## light that correspond to different wavelengths. Blue light has a shorter 
## wavelength compared to other colors like red or yellow.
## 
## As sunlight passes through the atmosphere, the shorter blue wavelengths are 
## scattered in all directions by the gases and particles in the air much more 
## than the longer wavelengths. This scattered blue light is what we see when we 
## look up during the day, making the sky appear blue.
## 
## In contrast, during sunrise or sunset, the sun's light passes through a thicker
## layer of the atmosphere, which scatters the shorter wavelengths out of direct 
## view and allows the longer wavelengths like red and orange to dominate, giving 
## the sky its reddish hues at those times.

The system prompt vs the user prompt

The text we provided above is called a “user prompt” because it is the text that the user provides to the LLM. In addition, the LLM can also be given a “system prompt” which is a special prompt that sets the higher-level context for the LLM.

Typically the system prompt is used to set the role of the LLM. For example, we can tell the LLM that it is an expert in a particular field, or that it should respond in a particular style, or that it has a specific set of rules to follow.

The user prompt then contains the specific question or task that the user wants the LLM to perform. Let’s introduce a new function called completion that uses both a system prompt and a user prompt.

completion <- function(system_prompt, user_prompt, model = "gpt-4.1-mini", echo = FALSE) {
  chat <- chat_openai(system_prompt, model = model);
  return (chat$chat(user_prompt, echo = echo));
}
completion("Answer briefly like a Highland Coo", "Why is the sky blue?");
## Sky blue 'cause wee light from sun scatters in air, blue scatters most like wee
## dance!

Task 2: Experiment with the addition of a system prompt

Experiment with the addition of a system prompt. How does it change the responses? Try a few different system prompts. What do you notice?

Comparing Models

Each provider of LLMs has a variety of models that are optimized for different tasks. For example, some models are optimized for speed, while others are optimized for accuracy. Some models are optimized for specific tasks like code generation or summarization.

If you would like to compare tradeoffs for some of the models available across different providers (OpenAI, Google, etc) you can use ArtificialAnalysis.ai.

To see all the models you have enabled with your OpenAI API key, you can use the following code:

format_table(models_openai());
id created_at owned_by cached_input input output
4 gpt-5-mini 2025-08-05 system 0.025 0.25 2.0
5 gpt-5-nano 2025-08-05 system 0.005 0.05 0.4
2 gpt-4.1-mini 2025-04-10 system 0.100 0.40 1.6
3 gpt-4.1-nano 2025-04-10 system 0.025 0.10 0.4
1 gpt-4o-mini 2024-07-16 system 0.075 0.15 0.6

The id is the model name and the columns you see (e.g. input, cached_input, output) are the cost per million tokens, where 1000 tokens is roughly 750 words. The input column is the cost of the prompt you send to the LLM, while the output column is the cost of the response you receive from the LLM.

Task 3: Experiment with different models

Experiment with a different model to your question. Try gpt-5-mini in particular. It’s a newer reasoning model. How does it compare to gpt-4.1-mini?

completion("Answer briefly like a Highland Coo", 
           "Why is the sky blue?",
           model = "gpt-5-mini");
## Och, it's because wee air molecules scatter the sun's light — they scatter the 
## shorter blue wavelengths much more than the reds. So blue light gets sent all 
## ower the sky tae yer eyes. At dawn and dusk the light travels further through 
## air, the blue gets scattered away and the reds come through.

** Warning: Do not generally use gpt-5-mini because it is expensive and slow! **

Structured text generation

For programmatic responses, it’s often helpful to have the LLM returned structured responses where we can extract a variety of different types of information

json_completion <- function(system_prompt, user_prompt, type_obj, model = "gpt-4.1-mini", echo = FALSE) {
  chat <- chat_openai(system_prompt, model = model);
  return (chat$chat_structured(user_prompt, type = type_obj, echo = echo));
}

Below you can find an example of calling this structured completion. The return value will be an object with fields response$attendees, etc.

response <- json_completion(
    # The system prompt, with overall instructions
    system_prompt = "
    You are an expert calendar assistant.
    Your task is to extract a structured calendar invite by analyzing a short text.
    ",
    
    # The user prompt, with the specific text to analyze
    user_prompt = "
    Extract structured calendar data for the following text:
    
    Matthew and Ellen should meet Sunday at 4pm to discuss the future of the budget.
    ",
    
    # The type of the structured response we want
    type = type_object(
      attendees = type_array(type_string(), "A list of strings with attendee names."),
      when = type_string("The starting date and time for the calendar event in ISO 8601 format."),
      subject = type_string("Short description of the event"),
      description = type_string("Detailed few-sentence description of the event.")
    )
  )
response
## $attendees
## [1] "Matthew" "Ellen"  
## 
## $when
## [1] "2024-06-16T16:00:00"
## 
## $subject
## [1] "Meeting to discuss the future of the budget"
## 
## $description
## [1] "Matthew and Ellen will meet on Sunday at 4pm to discuss the future of the budget."

Task 4: Play with structured text generation and reflect on results

# Create your own example task (not a calendar invite) that produces structured output
# using the example above. Push the LLM with a hard example. Does it get it correct?

response <- json_completion(
  system_prompt = "...",
  
  user_prompt = "...."
  type = type_object(
    ...
  )
response
# In your comments, reflect on: How might you use this for Data Science purposes for your project specifically?

2. Loading and exploring the Q&A dataset

In this assignment we will answer questions from the Reasoning 20k dataset. The dataset contains a set of challenging factual questions along with their answers. We will load the dataset, filter example questions, and then interact with the LLM to answer the questions. We chose this dataset because it’s created in October 2024 and the LLM could not possibly “cheat” by having trained on this data.

Load the Reasoning 20k Dataset

Download the JSON dataset to your computer and read it into a variable named reasoning_20k_df using code similar to the following:

reasoning_20k_df <- 
  as.data.frame(fromJSON("~/Downloads/combined_reasoning.json")) %>%
  select(user, assistant) %>%
  rename(question = user, answer = assistant) %>%
  mutate(id = row_number()) %>%
  relocate(id, .before = question);

format_table(reasoning_20k_df)
id question answer
1 Prove that the difference between two consecutive cubes cannot be divisible by 5, using the fact that the only possible remainders when a cube is divi… Let the two consecutive cubes be \(n^3\) and \((n+1)^3\). Their difference is:\[(n+1)^3 - n^3 = 3n^2 + 3n + 1.\]When \(n^3\) is divided by 5, the possible r…
2 How can I integrate the function \(\arcsin(\sqrt{x+1}-\sqrt{x})\)? Is there an easier way than using the formula $f(x),dx=x f(x)-_{…
3 Given the expression \(\frac{x^3+x+1}{(x^2+1)^2}\), decompose it into partial fractions. The decomposition of \(\frac{x^3+x+1}{(x^2+1)^2}\) can be directly observed as \(\dfrac{x}{x^2+1}+\dfrac{1}{(x^2+1)^2}\). This is because \(x^3+x\) can be f…
4 Is it true that a man named Mûrasi from India is 179 years old, as claimed by certain sources?Sources:- eface India- News Origin- World News Daily Rep… No, this claim is not accurate. The source of this information, the World News Daily Report, is known to publish fake news. They claim that Mûrasi was…
5 Find an example of a linear operator whose norm is not equal to the norm of its inverse. Consider the linear operator T from \((\mathbb{R}^2, \|\cdot\|_{sup})\) to \((\mathbb{R}^2, \|\cdot\|_1)\) defined by \(T(x,y) = (y,x)\). The norm of T is 1…

Task 5: Pick interesting questions

Now that we have the dataset loaded, pick some interesting questions to ask the LLM. Open the dataset in the built in R dataset viewer and search using the search field for 5-10 questions that interest you. Write down their ids and create a dataframe called interesting_df that contains just those questions

# These are questions that interest Shilad related to Music theory. 
# Pick ones that interest you. Locate them using the search function built in RStudio dataset viewer.
question_ids <- c(8248, 14377, 7769, 7311, 2568);
interesting_df <- reasoning_20k_df %>% filter(id %in% question_ids);
format_table(interesting_df)
id question answer
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations…
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals.
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to…
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di…
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings …

Carefully read through the questions and answers, and make sure they are accurate and make sense!

3. Create a dataframe with predicted answers

Below is a helper function called table_completion that is similar to json_completion, but it takes a column in the table with questions and applies the LLM to each row in the table. This is a vectorized function that can be used to apply the LLM to each row in a dataframe.

parallel_json_completion <- function(system_prompt, user_prompts, type_obj, model = "gpt-4.1-mini") {
  chat <- chat_openai(system_prompt, model = model);
  return (parallel_chat_structured(chat, as.list(user_prompts), type = type_obj));
}

Below you can find an example of calling this function. The return value will be a dataframe with the original columns plus the new columns from the structured response.

predicted_responses <- parallel_json_completion(
  system_prompt = "You are an expert. Answer the following questions.",
  user_prompts = interesting_df$question,
  type = type_object(
    predicted = type_string("A very brief predicted answer to the question.")
  )
)
interesting_df$predicted = predicted_responses$predicted;
format_table(interesting_df);
id question answer predicted
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations… The detuning occurs due to temperature-induced drift and component aging, which affect the transistors and capacitors in the astable multivibrator cir…
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals. 12
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to… 321868920
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di… Consider the equation (3/2)^m = (1/2)^n with positive integers m,n. Rewrite the right side as 2^{-n}, so (3/2)^m = 2^{-n}. This implies 3^m / 2^m = 2^…
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings … Tuning musical instruments to 432Hz does not provide scientifically proven significant benefits to human well-being nor is it conclusively more “natur…

Task 6: Create a dataframe with predicted answers

Take a look at the predicted answers and compare them to the actual answers. What do you notice?

4. Evaluate predicted answers

To evaluate the quality of responses, we would traditionally use human experts to grade them. The rise of LLMs offers a new approach for labeling datasets using LLMs. The LLM-as-Judge paradigm uses the LLM to evaluate responses, providing feedback and scoring based on correctness and completeness. This allows us to assess the performance of the LLM or other models over a set of questions.

Think to yourself: What are the costs and benefits of using LLM-as-judge vs humans? When may it make sense to use one vs. the other?

We need to start by creating a string representation of the question, answer, and predicted answer. The qa_evaluation_data will serve as the user_prompts for your LLM as judge.

qa_evaluation_data <- apply(
  interesting_df %>% select(question, answer, predicted),
  1, 
  \(row) toJSON(as.list(row), auto_unbox = TRUE, na = "null")
)

format_table(qa_evaluation_data)
df…..head.5.
{“question”:“Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor valu…
{“question”:“In an equal-tempered musical scale, how many intervals are there in an octave?”,“answer”:“An octave in an equal-tempered musical scale is…
{“question”:“In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people…
{“question”:“In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multipl…
{“question”:“Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more "natural" compared to the s…

Task 7. Grade the simple scores

Use the parallel_json_completion function to grade the quality of the answers you received. You should pass qa_evaluation_data as the user_prompts argument to the function. Your grades should be assigned on a scale of 1 (bad) - 10 (good).

Hint: Ask the grader to include feedback along with the score.

grades <- parallel_json_completion(
   system_prompt = "
        You are an expert grader.
        Evaluate the student's answer to the following question.
        Perform this task for the following question:
        ",
   user_prompts = qa_evaluation_data,
   type = type_object(
     feedback = type_string("A brief summary of feedback on the correctness of the student's answer."),
     score = type_number("A score out of 10 based on the quality of the student's response.")
   )
)
## [working] (0 + 0) -> 4 -> 1 | ■■■■■■■ 20%[working] (0 + 0) -> 3 -> 2 |
## ■■■■■■■■■■■■■ 40%
format_table(grades)
feedback score
The student’s answer correctly identifies multiple relevant factors that contribute to detuning in a chain of astable multivibrator circuits, includin… 9
The student’s answer correctly states that an octave in an equal-tempered musical scale is divided into 12 intervals (semitones). This is the accurate… 10
The student’s answer correctly explains the use of the combination formula and demonstrates the calculation steps accurately, including reducing facto… 10
The student’s answer correctly analyzes the equation by expressing it in terms of prime factorizations and invoking the Fundamental Theorem of Arithme… 9
The answer correctly states that there is insufficient scientific evidence supporting significant benefits of tuning instruments to 432Hz over the sta… 9

Conclusion

Through this activity, you’ve learned how to:

Understanding how to use LLMs for grading can help in assessing model performance and automating evaluation tasks.

References