Introduction

Large Language Models (LLMs) like GPT-4 have revolutionized how we communicate and understand information. In this activity, we’ll explore how to leverage LLMs in R for Data Science using the openai package.

We’ll start by playing with OpenAI API for free-text and structured responses. Then, we’ll use LLMs to answer - and grade - textual questions and answers.

Objectives:

Prerequisites

Setup

Download the Reasoning 20k dataset and place the combined_reasoning.json file somewhere you can find it.

# Install and load required packages
# install.packages("openai");
# install.packages("httr");
# install.packages("jsonlite");
# install.packages("furrr");
# install.packages("kableExtra")

library(openai);
library(httr);
## 
## Attaching package: 'httr'
## The following object is masked from 'package:openai':
## 
##     upload_file
library(jsonlite);
library(tidyverse);
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ purrr::flatten()    masks jsonlite::flatten()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ httr::upload_file() masks openai::upload_file()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(furrr); # for parallel map
## Loading required package: future
library(kableExtra);
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

Utility function to format tables nicely:

# Function to truncate text in all string (character) columns of a data frame
format_table <- function(df, max_length = 150) {
  head_df <- data.frame(df %>% head(5))
  # Function to truncate individual text entries
  truncate_text <- function(text, max_length) {
    text <- gsub("[\r\n]", "", text)
    return (
      ifelse(nchar(text) > max_length, 
             paste0(substr(text, 1, max_length), "..."), 
             text)
    )
  }
  
  # Loop over all columns that are character type and apply truncation
  for (col in colnames(head_df)) {
    if (is.character(head_df[[col]])) {
      head_df[[col]] <- sapply(head_df[[col]], truncate_text, max_length = max_length)
    }
  }
  
  # Return the modified data frame
  head_df %>%
    kbl() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size=12)
}
# Set your OpenAI API key. Shilad or your instructor will give this to you.
Sys.setenv(OPENAI_API_KEY = "YOUR API KEY")

1. Experimenting with the OpenAI API

Our interaction with the LLM will be through the OpenAI Chat Completions API. This API allows us to interact with the LLM by providing prompts and receiving completions. As of 2024, this API is by far the most popular way to interact with LLMs. In this activity we will use this API to ask questions, generate text, and even grade responses.

Simple text completion

To begin, let’s start with a simple question answering task using the GPT-4o-mini model. We are going to encapsulate the question answering logic in a function called completion. It effectively asks the LLM for an answer to a question.

completion <- function(prompt, max_tokens = 100) {
  # Get the response from the LLM
  response <- openai::create_chat_completion(
    model = "gpt-4o-mini",
    messages = list(list(role = "system", content = prompt)),
    temperature = 0.1,
    max_tokens = max_tokens
  )
  
  # Return the response
  return (response$choices$message.content);
}

Task 1: Play with simple text completion and reflect on results

Experiment with the function by asking a simple question as shown in the example below. Change the question below to several that you are interested in or have expertise related to. How does it perform? What does it do well? What does it do poorly? Put your example questions and analysis in the code below.

# Ask a few questions. In comments answer: what does it do well? What does it do poorly?
completion("Why is the sky blue?", 200);
## [1] "The sky appears blue primarily due to a phenomenon called Rayleigh scattering. When sunlight enters the Earth's atmosphere, it is made up of different colors, each with varying wavelengths. Blue light has a shorter wavelength compared to other colors like red or yellow.\n\nAs sunlight passes through the atmosphere, it collides with air molecules and small particles. Because blue light is scattered in all directions more than other colors due to its shorter wavelength, we see a predominance of blue when we look up at the sky.\n\nDuring sunrise and sunset, the sky can appear red or orange because the sunlight has to pass through a thicker layer of the atmosphere. This longer path scatters the shorter blue wavelengths out of our line of sight, allowing the longer red wavelengths to dominate."

Structured text generation

For programmatic responses, it’s often helpful to have the LLM returned structured responses where we can extract a variety of different types of information

json_completion <- function(prompt, max_tokens = 200) {
  # Get the response from the LLM
  response <- openai::create_chat_completion(
    model = "gpt-4o-mini",
    messages = list(list(role = "system", content = prompt)),
    temperature = 0.0,
    max_tokens = max_tokens
  )
  
  json <- response$choices$message.content;
  
  # Shilad: This is a hack to remove occassional responses that wrap the json with ``json... ```  in 4o-mini.
  # Ideally we would use {response_type : json_object} to avoid this but it's not supported by the R OpenAI wrapper.
  pattern <- regex("```json(.*?)```", dotall = TRUE);
  if (str_detect(json, pattern)) {
    json <- str_match(json, pattern)[, 2]; # Extract the matched JSON content
  }
  
  return (fromJSON(json));
}

Below you can find an example of calling this structured completion. The return value will be an object with fields response$attendees, etc.

response <- json_completion("
Your task is to extract a structured calendar invite by analyzing a short text.

The return value should be a JSON object with the following fields:
- attendees: A list of strings with attendee names.
- when: The starting date and time for the calendar event
- subject: Short description of the event
- description: Detailed few-sentence description of the event.

Perform this task for the following text:

Matthew and Ellen should meet Sunday at 4pm to discuss the future of the budget.
")
response
## $attendees
## [1] "Matthew" "Ellen"  
## 
## $when
## [1] "2023-10-29T16:00:00"
## 
## $subject
## [1] "Budget Discussion"
## 
## $description
## [1] "Matthew and Ellen will meet to discuss the future of the budget. This meeting aims to address key financial strategies and planning for upcoming projects."

Task 2: Play with structured text generation and reflect on results

# Create your own example task (not a calendar invite) that produces structured output
# using the example above. Push the LLM with a hard example. Does it get it correct?

response <- json_completion("
Your task is to ....
")
response
# In your comments, reflect on: How might you use this for Data Science purposes?

2. Loading and exploring the Q&A dataset

In this assignment we will answer questions from the Reasoning 20k dataset. The dataset contains a set of challenging factual questions along with their answers. We will load the dataset, filter example questions, and then interact with the LLM to answer the questions. We chose this dataset because it’s created in October 2024 and the LLM could not possibly “cheat” by having trained on this data.

Load the Reasoning 20k Dataset

Download the JSON dataset to your computer and read it into a variable named reasoning_20k_df using code similar to the following:

reasoning_20k_df <- 
  as.data.frame(fromJSON("~/Downloads/combined_reasoning.json")) %>%
  select(user, assistant) %>%
  rename(question = user, answer = assistant) %>%
  mutate(id = row_number()) %>%
  relocate(id, .before = question);

format_table(reasoning_20k_df)
id question answer
1 Prove that the difference between two consecutive cubes cannot be divisible by 5, using the fact that the only possible remainders when a cube is divi… Let the two consecutive cubes be \(n^3\) and \((n+1)^3\). Their difference is:\[(n+1)^3 - n^3 = 3n^2 + 3n + 1.\]When \(n^3\) is divided by 5, the possible r…
2 How can I integrate the function \(\arcsin(\sqrt{x+1}-\sqrt{x})\)? Is there an easier way than using the formula $f(x),dx=x f(x)-_{…
3 Given the expression \(\frac{x^3+x+1}{(x^2+1)^2}\), decompose it into partial fractions. The decomposition of \(\frac{x^3+x+1}{(x^2+1)^2}\) can be directly observed as \(\dfrac{x}{x^2+1}+\dfrac{1}{(x^2+1)^2}\). This is because \(x^3+x\) can be f…
4 Is it true that a man named Mûrasi from India is 179 years old, as claimed by certain sources?Sources:- eface India- News Origin- World News Daily Rep… No, this claim is not accurate. The source of this information, the World News Daily Report, is known to publish fake news. They claim that Mûrasi was…
5 Find an example of a linear operator whose norm is not equal to the norm of its inverse. Consider the linear operator T from \((\mathbb{R}^2, \|\cdot\|_{sup})\) to \((\mathbb{R}^2, \|\cdot\|_1)\) defined by \(T(x,y) = (y,x)\). The norm of T is 1…

Task 3: Pick interesting questions

Now that we have the dataset loaded, pick some interesting questions to ask the LLM. Open the dataset in the built in R dataset viewer and search using the search field for 5-10 questions that interest you. Write down their ids and create a dataframe called interesting_df that contains just those questions

# These are questions that interest Shilad related to Music theory. 
# Pick ones that interest you. Locate them using the search function built in RStudio dataset viewer.
question_ids <- c(8248, 14377, 7769, 7311, 2568);
interesting_df <- reasoning_20k_df %>% filter(id %in% question_ids);
format_table(interesting_df)
id question answer
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations…
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals.
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to…
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di…
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings …

Carefully read through the questions and answers, and make sure they are accurate and make sense!

3. The row-mapping helper function

Throughout this activity we are going to use a helper function called map_table_rows that applies a function to each row of a dataframe and returns a new dataframe with the original columns and the new columns. This function is useful for applying the LLM to each row of a dataset.

We are going to use the pmap function from the furrr package which is a parallelized version to speed up the process.

map_table_rows <- function(df, mapping_function) {
  result <- future_pmap(df, mapping_function);
  return (cbind(df, bind_rows(result)));
}

Task 4. Understand the map_table_rows example

Below is an example of how to use the map_table_rows function. Take a look at the code and add a comment explaining what it is doing.

Add a second example that adds the square of x + y to the dataframe.

# Add a comment below indicating exactly what is happening
df <- data.frame(
  x = c(1, 2, 3, 4),
  y = c(5, 6, 7, 8),
  z = c(9, 10, 11, 12)
);

example_mapper <- function(x, y, ...) {
  return (list(
    sum = x + y,
    product = x * y
  ))
}

df %>% map_table_rows(example_mapper)
##   x y  z sum product
## 1 1 5  9   6       5
## 2 2 6 10   8      12
## 3 3 7 11  10      21
## 4 4 8 12  12      32
# Add a second example that adds the square of x + y to the dataframe.

To speed up this mapping, we will ask furrr to make up to work on 10 rows in parallel. This will mean we execute up to 10 LLM calls at the same time.

plan(multisession, workers = 10)
## Warning in checkNumberOfLocalWorkers(workers): Careful, you are setting up 10
## localhost parallel workers with only 8 CPU cores available for this R process
## (per 'system'), which could result in a 125% load. The soft limit is set to
## 100%. Overusing the CPUs has negative impact on the current R process, but also
## on all other processes of yours and others running on the same machine. See
## help("parallelly.options", package = "parallelly") for how to override the soft
## and hard limits

When you run this, depending on how many cores your laptop has, you may see a warning message indicating that this setting may saturate your CPU. Why is this unlikely when we use the parallel functions to interact with the OpenAI API?

4. Create a dataframe with predicted answers

Task 5: Create a dataframe with predicted answers

Create a new dataset called simple_answers that takes interesting_df and uses the vectorized function we just created to add a predicted column with the generated answer.

# Complete your implementation of the function below. 
# It should be a one-liner that calls the `completion` function.
# The name of the new column MUST be `predicted`
add_simple_answer <- function(question, ...) {
  return (list(predicted = completion(question)));
}

simple_answers <- 
  interesting_df %>%
  map_table_rows(add_simple_answer);

format_table(simple_answers)
id question answer predicted
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations… A music synthesizer built with a chain of astable multivibrator circuits can experience detuning over time for several reasons, even when using fixed …
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals. In an equal-tempered musical scale, there are 12 intervals in an octave. These intervals are typically referred to as semitones or half steps. Each se…
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to… To find the number of different groups of 4 that can be formed from 142 people, we can use the combination formula, which is given by:[C(n, r) = …
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di… To prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples, we need to analyze the …
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings … The debate over tuning musical instruments to 432Hz versus the standard 440Hz is a topic of interest among musicians, sound healers, and some wellness…

Take a look at the predicted answers and compare them to the actual answers. What do you notice?

5. Evaluate predicted answers

To evaluate the quality of responses, we would traditionally use human experts to grade them. The rise of LLMs offers a new approach for labeling datasets using LLMs. The LLM-as-Judge paradigm uses the LLM to evaluate responses, providing feedback and scoring based on correctness and completeness. This allows us to assess the performance of the LLM or other models over a set of questions.

Think to yourself: What are the costs and benefits of using LLM-as-judge vs humans? When may it make sense to use one vs. the other?

Below is code that takes a dataframe and returns the grades for each question:

grade_predicted_answer <- function(question, answer, predicted, ...) {
   prompt <- paste(
        "
        You are an expert grader.
        Evaluate the student's answer to the following question.
        Provide your response as a JSON object with the following attributes:
        - feedback: A brief summary of feedback on the correctness of the student's answer.
        - score: A score out of 10 based on the quality of the student's response.
        
        Perform this task for the following question:
        ",
        toJSON(
          list(question = question, answer = answer, predicted = predicted),
          auto_unbox = TRUE
        )
   );
   grading_response <- json_completion(prompt);
   return (list(
     score = grading_response$score,
     feedback = grading_response$feedback
   ));
};

Task 6: Grade the simple scores

Use the function above along with map_table_rows to assign grades to each predicted answer. In comments

# Use map_table_rows and grade_predicted_answer to assign grades to simple_answers
simple_answers %>% 
  map_table_rows(grade_predicted_answer) %>%
  format_table()
id question answer predicted score feedback
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations… A music synthesizer built with a chain of astable multivibrator circuits can experience detuning over time for several reasons, even when using fixed … 8 The student’s answer provides a comprehensive explanation of the factors contributing to detuning in a music synthesizer built with astable multivibra…
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals. In an equal-tempered musical scale, there are 12 intervals in an octave. These intervals are typically referred to as semitones or half steps. Each se… 8 The student’s answer is correct in stating that an octave in an equal-tempered musical scale is divided into 12 intervals. However, it could be improv…
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to… To find the number of different groups of 4 that can be formed from 142 people, we can use the combination formula, which is given by:[C(n, r) = … 8 The student correctly identifies the use of the combination formula and provides a clear explanation of the calculation process. However, the final an…
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di… To prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples, we need to analyze the … 6 The student’s answer correctly identifies the equation and attempts to prove that there are no solutions for positive integers m and n. However, the e…
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings … The debate over tuning musical instruments to 432Hz versus the standard 440Hz is a topic of interest among musicians, sound healers, and some wellness… 8 The student’s answer effectively addresses the question by highlighting the lack of scientific evidence supporting the benefits of 432Hz tuning over 4…
# Look at the scores and feedback. Does it make sense to you?

6. Create Chain of Thought answers

Chain of Thought is a prompting technique where you ask the LLM to work through the problem in structured steps.This can be useful for generating more detailed answers or exploring a topic in depth.

Task 7: Implementing Chain of Thought Prompts

Apply the pattern you see in the grading function to create a new data frame called cot_answers (for “Chain of Thought”). * Use the json_completion function to generate a more detailed response to each question. You will need to raise the second argument from the default number of tokens to something higher (e.g. 1000). * You should ask the LLM to produce a JSON object with the following fields: - plan: A step-by-step plan for solving the problem. - details: A detailed step-by-step solution to the problem. - answer: The final answer to the question. * Note that the order of the fields in the JSON object is important. You must force the LLM to generate output in the order it “thinks.”

add_cot_answer <- function(question, ...) {
   prompt <- paste(
      "
      Your task is to provide a detailed response to the following question.
      Generate a JSON object with the following fields:
      - plan: A single string containing step-by-step plan for solving the problem.
      - details: A single string containing detailed step-by-step solution to the problem.
      - answer: The single string containing final answer to the question.
      
      Perform this task for the following question:
      ",
      question
   );
   cot_response <- json_completion(prompt, max_tokens = 1000);
   return (list(
     plan = cot_response$plan,
     details = cot_response$details,
     predicted = cot_response$answer
   ));
};

cot_answers <- 
  interesting_df %>%
  map_table_rows(add_cot_answer);

format_table(cot_answers)
id question answer plan details predicted
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations…
  1. Understand the function of astable multivibrators in synthesizers. 2. Investigate the factors that can cause detuning in electronic circuits. 3. An…
  1. Astable multivibrators are circuits that generate a continuous square wave output, commonly used in synthesizers to create audio signals. They rely…
A music synthesizer built with a chain of astable multivibrator circuits detunes after a few hours due to temperature variations affecting resistor an…
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals.
  1. Define what an octave is in music theory. 2. Explain the concept of intervals in an equal-tempered scale. 3. Count the number of distinct intervals…
  1. An octave in music theory is the interval between one musical pitch and another with double its frequency. For example, if a note has a frequency o…
12
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to…
  1. Identify the total number of participants (142). 2. Use the combination formula to calculate the number of ways to choose 4 people from 142. 3. The…
  1. We have 142 participants. We need to find the number of ways to choose 4 participants from these 142. 2. The combination formula is C(n, r) = n! / …
16242880
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di…
  1. Understand the concept of just intonation and pure fifths. 2. Define the mathematical equation to be proven. 3. Analyze the left side of the equati…
  1. In just intonation, a pure fifth is represented by the ratio 3/2. Stacking m pure fifths can be expressed as (3/2)^m. 2. An octave is represented b…
Stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples.
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings …
  1. Research the history and context of 432Hz and 440Hz tuning. 2. Investigate scientific studies on the effects of different tuning frequencies on hum…
  1. The history of musical tuning reveals that 440Hz became the standard tuning frequency in the mid-20th century, while 432Hz has been associated with…
There is no substantial scientific evidence to support that tuning musical instruments to 432Hz provides significant benefits to human well-being comp…

Task 8: Grade Chain of Thought Prompts

Finally, use the same procedure you did earlier to grade the new responses. Do you see any interesting differences?

cot_answers %>% 
  map_table_rows(grade_predicted_answer) %>%
  format_table()
id question answer plan details predicted score feedback
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations…
  1. Understand the function of astable multivibrators in synthesizers. 2. Investigate the factors that can cause detuning in electronic circuits. 3. An…
  1. Astable multivibrators are circuits that generate a continuous square wave output, commonly used in synthesizers to create audio signals. They rely…
A music synthesizer built with a chain of astable multivibrator circuits detunes after a few hours due to temperature variations affecting resistor an… 8 The student’s answer provides a comprehensive explanation of the factors contributing to detuning in a music synthesizer, including power supply fluct…
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals.
  1. Define what an octave is in music theory. 2. Explain the concept of intervals in an equal-tempered scale. 3. Count the number of distinct intervals…
  1. An octave in music theory is the interval between one musical pitch and another with double its frequency. For example, if a note has a frequency o…
12 10 The student’s answer is correct and accurately states that there are 12 intervals in an octave in an equal-tempered musical scale.
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to…
  1. Identify the total number of participants (142). 2. Use the combination formula to calculate the number of ways to choose 4 people from 142. 3. The…
  1. We have 142 participants. We need to find the number of ways to choose 4 participants from these 142. 2. The combination formula is C(n, r) = n! / …
16242880 7 The student correctly explained the combination formula and applied it to the problem, but the final calculation of the number of groups is incorrect….
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di…
  1. Understand the concept of just intonation and pure fifths. 2. Define the mathematical equation to be proven. 3. Analyze the left side of the equati…
  1. In just intonation, a pure fifth is represented by the ratio 3/2. Stacking m pure fifths can be expressed as (3/2)^m. 2. An octave is represented b…
Stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. 7 The student’s answer correctly identifies the equation to be proven and provides a logical argument using the Fundamental Theorem of Arithmetic. Howev…
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings …
  1. Research the history and context of 432Hz and 440Hz tuning. 2. Investigate scientific studies on the effects of different tuning frequencies on hum…
  1. The history of musical tuning reveals that 440Hz became the standard tuning frequency in the mid-20th century, while 432Hz has been associated with…
There is no substantial scientific evidence to support that tuning musical instruments to 432Hz provides significant benefits to human well-being comp… 8 The student’s answer accurately addresses the question by highlighting the lack of scientific evidence supporting the benefits of 432Hz tuning over 44…

7. Use RAG to ground answers

The RAG technique is a prompting technique where you ask the LLM to generate a response, then ask it to generate a response that is better than the first response, and then ask it to generate a response that is better than the second response. This can be useful for iteratively improving answers.

Add on article text to the data frame

To perform RAG, we will ask the LLM to first generate search queries based on the questions.

# Step 1: Get search keywords
add_keywords <- function(question, ...) {
  # Extract 
   prompt <- paste(
      "
      Your task is to identify three diverse, detailed search queries to gather information from Wikipedia about the following question.
      The queries are going to be run through Wikipedia's internal search engine to retrieve relevant entire article. 
      So make sure they are in the \"goldilocks\" zone of specificity.
      They should be specific enough to identify the best related article, but not more specific than that.
      Provide your response as a JSON object with the following attributes:
      - query1: A string containing the first search query
      - query2: A string containing the second search query
      - query3: A string containing the third search query
      ",
      question
   );
   query_response <- json_completion(prompt, max_tokens = 500);
   return (list(
     query1 = query_response$query1,
     query2 = query_response$query2,
     query3 = query_response$query3
   ));
}


# Add keywords to the dataset
rag_df <- 
  interesting_df %>% 
  map_table_rows(add_keywords);
rag_df %>% format_table()
id question answer query1 query2 query3
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations… astable multivibrator circuits in music synthesizers synthesizer tuning stability and temperature effects effects of component aging on electronic circuits
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals. equal-tempered musical scale intervals in an octave octave definition in music theory music intervals and scales explanation
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to… combinatorial mathematics groups of four combinations formula example binomial coefficient calculation
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di… just intonation and pure fifths music theory mathematical proof stacking fifths octave tuning properties of just intonation and octave equivalence
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings … 432Hz tuning benefits human well-being comparison of 432Hz and 440Hz musical tuning effects of musical tuning frequencies on health

Next, we use these search queries to fetch the article text from Wikipedia. After doing so, we ask the LLM to summarize the salient facts from the articles that are relevant to the question.

# Step 2: Get article text for each query
add_articles <- function(question, query1, query2, query3, ...) {
  # Get the article text for each query
  article1 <- search_wikipedia(query1)
  article2 <- search_wikipedia(query2)
  article3 <- search_wikipedia(query3)
  
  # Use the LLM to extract key facts the LLM should focus on.
  facts <- completion(paste(
    "You are an expert synthesizer. Your job is to extract the most salient facts from three different articles related to a question.",
    "Write three to five sentences capturing the most important facts needed to answer the question.",
    toJSON(list(question = question, articles = list(article1$text, article2$text, article3$text)))
  ))
  
  # Return the results
  return (list(
     article_title1 = article1$title,
     article_title2 = article2$title,
     article_title3 = article3$title,
     article_text1 = article1$text,
     article_text2 = article2$text,
     article_text3 = article3$text,
     facts=facts
   ));
}


# Add keywords to the dataset
rag_facts_df <- 
  rag_df %>% 
  map_table_rows(add_articles);
rag_facts_df %>% 
  format_table()
id question answer query1 query2 query3 article_title1 article_title2 article_title3 article_text1 article_text2 article_text3 facts
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations… astable multivibrator circuits in music synthesizers synthesizer tuning stability and temperature effects effects of component aging on electronic circuits Feedback Crystal oscillator Digital electronics Feedback occurs when outputs of a system are routed back as inputs as part of a chain of cause-and-effect that forms a circuit or loop. The system can… A crystal oscillator is an electronic oscillator circuit that uses a piezoelectric crystal as a frequency-selective element. The oscillator frequency … Digital electronics is a field of electronics involving the study of digital signals and the engineering of devices that use or produce them. This is … A music synthesizer built with a chain of astable multivibrator circuits can detune after a few hours due to several factors, even when using fixed re…
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals. equal-tempered musical scale intervals in an octave octave definition in music theory music intervals and scales explanation Equal temperament Scale (music) Scale (music) An equal temperament is a musical temperament or tuning system that approximates just intervals by dividing an octave (or other interval) into steps s… In music theory, a scale is “any consecutive series of notes that form a progression between one note and its octave”, typically by order of pitch or … In music theory, a scale is “any consecutive series of notes that form a progression between one note and its octave”, typically by order of pitch or … In an equal-tempered musical scale, specifically the most common system known as 12-tone equal temperament (12 TET), there are 12 intervals in an octa…
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to… combinatorial mathematics groups of four combinations formula example binomial coefficient calculation Discrete mathematics Combination Binomial coefficient Discrete mathematics is the study of mathematical structures that can be considered “discrete” (in a way analogous to discrete variables, having a bij… In mathematics, a combination is a selection of items from a set that has distinct members, such that the order of selection does not matter (unlike p… In mathematics, the binomial coefficients are the positive integers that occur as coefficients in the binomial theorem. Commonly, a binomial coefficie… To determine how many different groups of 4 can be formed from 142 people participating in a musical event called extreme quarteting, we can use the c…
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di… just intonation and pure fifths music theory mathematical proof stacking fifths octave tuning properties of just intonation and octave equivalence Just intonation List of guitar tunings Semitone In music, just intonation or pure intonation is the tuning of musical intervals as whole number ratios (such as 3:2 or 4:3) of frequencies. An interva… This article contains a list of guitar tunings that supplements the article guitar tunings. In particular, this list contains more examples of open an… A semitone, also called a minor second, half step, or a half tone, is the smallest musical interval commonly used in Western tonal music, and it is co… In music theory, stacking a series of just intonation pure fifths (ratios of 3:2) will not yield a perfectly tuned octave (ratio of 2:1) or its multip…
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings … 432Hz tuning benefits human well-being comparison of 432Hz and 440Hz musical tuning effects of musical tuning frequencies on health NA NA Psychoacoustics NA NA Psychoacoustics is the branch of psychophysics involving the scientific study of the perception of sound by the human auditory system. It is the branc… Tuning musical instruments to 432Hz is often claimed to provide benefits to human well-being and is considered by some to be more “natural” than the s…

Finally, we use these facts to answer the questions.

add_rag_answer <- function(question, facts, ...) {
   prompt <- paste(
      "
      Your task is to provide a detailed response to the following question.
      You will be provided with facts from Wikipedia articles that may (or may not!) be relevant to the question.
      Read the question and then the three articles, and then generate a response by generate a JSON object with the following fields:
      - plan: A single string containing step-by-step plan for solving the problem.
      - details: A single string containing detailed step-by-step solution to the problem.
      - answer: The single string containing final answer to the question.
      
      Perform this task for the following question:
      ",
      toJSON(list(question = question, facts = facts))
   );
   rag_response <- json_completion(prompt, max_tokens = 2000);
   return (list(
     plan = rag_response$plan,
     details = rag_response$details,
     predicted = rag_response$answer
   ));
};

rag_answers <- 
  rag_facts_df %>%
  mutate_if(is.character, list(~na_if(.,""))) %>% # Remove NA values
  map_table_rows(add_rag_answer);

rag_answers %>% format_table()
id question answer query1 query2 query3 article_title1 article_title2 article_title3 article_text1 article_text2 article_text3 facts plan details predicted
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations… astable multivibrator circuits in music synthesizers synthesizer tuning stability and temperature effects effects of component aging on electronic circuits Feedback Crystal oscillator Digital electronics Feedback occurs when outputs of a system are routed back as inputs as part of a chain of cause-and-effect that forms a circuit or loop. The system can… A crystal oscillator is an electronic oscillator circuit that uses a piezoelectric crystal as a frequency-selective element. The oscillator frequency … Digital electronics is a field of electronics involving the study of digital signals and the engineering of devices that use or produce them. This is … A music synthesizer built with a chain of astable multivibrator circuits can detune after a few hours due to several factors, even when using fixed re…
  1. Identify the components of the music synthesizer circuit. 2. Analyze how temperature changes affect component values. 3. Investigate the aging proc…
To understand why a music synthesizer built with a chain of astable multivibrator circuits detunes after a few hours, we first need to identify the co… A music synthesizer built with a chain of astable multivibrator circuits detunes after a few hours due to component instability from temperature chang…
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals. equal-tempered musical scale intervals in an octave octave definition in music theory music intervals and scales explanation Equal temperament Scale (music) Scale (music) An equal temperament is a musical temperament or tuning system that approximates just intervals by dividing an octave (or other interval) into steps s… In music theory, a scale is “any consecutive series of notes that form a progression between one note and its octave”, typically by order of pitch or … In music theory, a scale is “any consecutive series of notes that form a progression between one note and its octave”, typically by order of pitch or … In an equal-tempered musical scale, specifically the most common system known as 12-tone equal temperament (12 TET), there are 12 intervals in an octa…
  1. Identify the type of musical scale mentioned in the question. 2. Refer to the provided facts to find information about the intervals in an octave f…
The question asks about the number of intervals in an octave within an equal-tempered musical scale. The relevant fact states that in the 12-tone equa… There are 12 intervals in an octave in an equal-tempered musical scale.
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to… combinatorial mathematics groups of four combinations formula example binomial coefficient calculation Discrete mathematics Combination Binomial coefficient Discrete mathematics is the study of mathematical structures that can be considered “discrete” (in a way analogous to discrete variables, having a bij… In mathematics, a combination is a selection of items from a set that has distinct members, such that the order of selection does not matter (unlike p… In mathematics, the binomial coefficients are the positive integers that occur as coefficients in the binomial theorem. Commonly, a binomial coefficie… To determine how many different groups of 4 can be formed from 142 people participating in a musical event called extreme quarteting, we can use the c…
  1. Identify the total number of participants (n = 142) and the group size (k = 4). 2. Use the formula for combinations to calculate the number of ways…
To find the number of different groups of 4 that can be formed from 142 people, we will use the combinations formula C(n, k) = n! / (k! * (n - k)!), w… 16515035
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di… just intonation and pure fifths music theory mathematical proof stacking fifths octave tuning properties of just intonation and octave equivalence Just intonation List of guitar tunings Semitone In music, just intonation or pure intonation is the tuning of musical intervals as whole number ratios (such as 3:2 or 4:3) of frequencies. An interva… This article contains a list of guitar tunings that supplements the article guitar tunings. In particular, this list contains more examples of open an… A semitone, also called a minor second, half step, or a half tone, is the smallest musical interval commonly used in Western tonal music, and it is co… In music theory, stacking a series of just intonation pure fifths (ratios of 3:2) will not yield a perfectly tuned octave (ratio of 2:1) or its multip…
  1. Define the ratios for just intonation pure fifths and octaves. 2. Set up the equation (3/2)^m = (1/2)^n. 3. Rewrite the equation in terms of a comm…
  1. In music theory, the ratio for a just intonation pure fifth is 3:2, and the ratio for an octave is 2:1. 2. We start with the equation (3/2)^m = (1/…
Stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples, as the equation (3/2)^m = (1/2)^n has…
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings … 432Hz tuning benefits human well-being comparison of 432Hz and 440Hz musical tuning effects of musical tuning frequencies on health NA NA Psychoacoustics NA NA Psychoacoustics is the branch of psychophysics involving the scientific study of the perception of sound by the human auditory system. It is the branc… Tuning musical instruments to 432Hz is often claimed to provide benefits to human well-being and is considered by some to be more “natural” than the s…
  1. Analyze the claims about 432Hz tuning and its benefits. 2. Review the scientific evidence regarding the effects of different tuning frequencies on …
First, we need to examine the claims surrounding 432Hz tuning, which is often said to be more harmonious with nature and beneficial for human well-bei… There is no significant scientific evidence that tuning musical instruments to 432Hz provides benefits to human well-being compared to the standard 44…

And evaluate the results using LLM-as-judge.

rag_answers %>% 
  map_table_rows(grade_predicted_answer) %>%
  format_table()
id question answer query1 query2 query3 article_title1 article_title2 article_title3 article_text1 article_text2 article_text3 facts plan details predicted score feedback
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations… astable multivibrator circuits in music synthesizers synthesizer tuning stability and temperature effects effects of component aging on electronic circuits Feedback Crystal oscillator Digital electronics Feedback occurs when outputs of a system are routed back as inputs as part of a chain of cause-and-effect that forms a circuit or loop. The system can… A crystal oscillator is an electronic oscillator circuit that uses a piezoelectric crystal as a frequency-selective element. The oscillator frequency … Digital electronics is a field of electronics involving the study of digital signals and the engineering of devices that use or produce them. This is … A music synthesizer built with a chain of astable multivibrator circuits can detune after a few hours due to several factors, even when using fixed re…
  1. Identify the components of the music synthesizer circuit. 2. Analyze how temperature changes affect component values. 3. Investigate the aging proc…
To understand why a music synthesizer built with a chain of astable multivibrator circuits detunes after a few hours, we first need to identify the co… A music synthesizer built with a chain of astable multivibrator circuits detunes after a few hours due to component instability from temperature chang… 8 The student’s answer provides a comprehensive explanation of the factors contributing to detuning in a music synthesizer, including power supply fluct…
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals. equal-tempered musical scale intervals in an octave octave definition in music theory music intervals and scales explanation Equal temperament Scale (music) Scale (music) An equal temperament is a musical temperament or tuning system that approximates just intervals by dividing an octave (or other interval) into steps s… In music theory, a scale is “any consecutive series of notes that form a progression between one note and its octave”, typically by order of pitch or … In music theory, a scale is “any consecutive series of notes that form a progression between one note and its octave”, typically by order of pitch or … In an equal-tempered musical scale, specifically the most common system known as 12-tone equal temperament (12 TET), there are 12 intervals in an octa…
  1. Identify the type of musical scale mentioned in the question. 2. Refer to the provided facts to find information about the intervals in an octave f…
The question asks about the number of intervals in an octave within an equal-tempered musical scale. The relevant fact states that in the 12-tone equa… There are 12 intervals in an octave in an equal-tempered musical scale. 10 The student’s answer is correct and accurately states that an octave in an equal-tempered musical scale is divided into 12 intervals. The phrasing is …
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to… combinatorial mathematics groups of four combinations formula example binomial coefficient calculation Discrete mathematics Combination Binomial coefficient Discrete mathematics is the study of mathematical structures that can be considered “discrete” (in a way analogous to discrete variables, having a bij… In mathematics, a combination is a selection of items from a set that has distinct members, such that the order of selection does not matter (unlike p… In mathematics, the binomial coefficients are the positive integers that occur as coefficients in the binomial theorem. Commonly, a binomial coefficie… To determine how many different groups of 4 can be formed from 142 people participating in a musical event called extreme quarteting, we can use the c…
  1. Identify the total number of participants (n = 142) and the group size (k = 4). 2. Use the formula for combinations to calculate the number of ways…
To find the number of different groups of 4 that can be formed from 142 people, we will use the combinations formula C(n, k) = n! / (k! * (n - k)!), w… 16515035 7 The student correctly explains the combination formula and applies it to the problem, but the final calculation of the number of groups is incorrect. …
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di… just intonation and pure fifths music theory mathematical proof stacking fifths octave tuning properties of just intonation and octave equivalence Just intonation List of guitar tunings Semitone In music, just intonation or pure intonation is the tuning of musical intervals as whole number ratios (such as 3:2 or 4:3) of frequencies. An interva… This article contains a list of guitar tunings that supplements the article guitar tunings. In particular, this list contains more examples of open an… A semitone, also called a minor second, half step, or a half tone, is the smallest musical interval commonly used in Western tonal music, and it is co… In music theory, stacking a series of just intonation pure fifths (ratios of 3:2) will not yield a perfectly tuned octave (ratio of 2:1) or its multip…
  1. Define the ratios for just intonation pure fifths and octaves. 2. Set up the equation (3/2)^m = (1/2)^n. 3. Rewrite the equation in terms of a comm…
  1. In music theory, the ratio for a just intonation pure fifth is 3:2, and the ratio for an octave is 2:1. 2. We start with the equation (3/2)^m = (1/…
Stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples, as the equation (3/2)^m = (1/2)^n has… 8 The student’s answer correctly identifies the equation and provides a valid mathematical argument to show that there are no solutions for positive int…
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings … 432Hz tuning benefits human well-being comparison of 432Hz and 440Hz musical tuning effects of musical tuning frequencies on health NA NA Psychoacoustics NA NA Psychoacoustics is the branch of psychophysics involving the scientific study of the perception of sound by the human auditory system. It is the branc… Tuning musical instruments to 432Hz is often claimed to provide benefits to human well-being and is considered by some to be more “natural” than the s…
  1. Analyze the claims about 432Hz tuning and its benefits. 2. Review the scientific evidence regarding the effects of different tuning frequencies on …
First, we need to examine the claims surrounding 432Hz tuning, which is often said to be more harmonious with nature and beneficial for human well-bei… There is no significant scientific evidence that tuning musical instruments to 432Hz provides benefits to human well-being compared to the standard 44… 8 The student’s answer accurately addresses the question by highlighting the lack of scientific evidence supporting the benefits of 432Hz tuning over 44…

Conclusion

Through this activity, you’ve learned how to:

Understanding how to use LLMs for grading can help in assessing model performance and automating evaluation tasks.

Further Exploration

References