Introduction

Large Language Models (LLMs) like GPT-4 have revolutionized how we communicate and understand information. In this activity, we’ll explore how to leverage LLMs in R for Data Science using the openai package.

We’ll start by playing with OpenAI API for free-text and structured responses. Then, we’ll use LLMs to answer - and grade - textual questions and answers.

Objectives:

Prerequisites

Setup

Download the Reasoning 20k dataset and place the combined_reasoning.json file somewhere you can find it.

# Install and load required packages
# install.packages("openai");
# install.packages("httr");
# install.packages("jsonlite");
# install.packages("furrr");
# install.packages("kableExtra")

library(openai);
library(httr);
## 
## Attaching package: 'httr'
## The following object is masked from 'package:openai':
## 
##     upload_file
library(jsonlite);
library(tidyverse);
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ purrr::flatten()    masks jsonlite::flatten()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ httr::upload_file() masks openai::upload_file()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(furrr); # for parallel map
## Loading required package: future
library(kableExtra);
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

Utility function to format tables nicely:

# Function to truncate text in all string (character) columns of a data frame
format_table <- function(df, max_length = 150) {
  head_df <- data.frame(df %>% head(5))
  # Function to truncate individual text entries
  truncate_text <- function(text, max_length) {
    text <- gsub("[\r\n]", "", text)
    return (
      ifelse(nchar(text) > max_length, 
             paste0(substr(text, 1, max_length), "..."), 
             text)
    )
  }
  
  # Loop over all columns that are character type and apply truncation
  for (col in colnames(head_df)) {
    if (is.character(head_df[[col]])) {
      head_df[[col]] <- sapply(head_df[[col]], truncate_text, max_length = max_length)
    }
  }
  
  # Return the modified data frame
  head_df %>%
    kbl() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size=12)
}
# Set your OpenAI API key. Shilad or your instructor will give this to you.
Sys.setenv(OPENAI_API_KEY = "YOUR API KEY")

1. Experimenting with the OpenAI API

Our interaction with the LLM will be through the OpenAI Chat Completions API. This API allows us to interact with the LLM by providing prompts and receiving completions. As of 2024, this API is by far the most popular way to interact with LLMs. In this activity we will use this API to ask questions, generate text, and even grade responses.

Simple text completion

To begin, let’s start with a simple question answering task using the GPT-4o-mini model. We are going to encapsulate the question answering logic in a function called completion. It effectively asks the LLM for an answer to a question.

completion <- function(prompt, max_tokens = 100) {
  # Get the response from the LLM
  response <- openai::create_chat_completion(
    model = "gpt-4o-mini",
    messages = list(list(role = "system", content = prompt)),
    temperature = 0.1,
    max_tokens = max_tokens
  )
  
  # Return the response
  return (response$choices$message.content);
}

Task 1: Play with simple text completion and reflect on results

Experiment with the function by asking a simple question as shown in the example below. Change the question below to several that you are interested in or have expertise related to. How does it perform? What does it do well? What does it do poorly? Put your example questions and analysis in the code below.

# Ask a few questions. In comments answer: what does it do well? What does it do poorly?
completion("Why is the sky blue?", 200);
## [1] "The sky appears blue primarily due to a phenomenon called Rayleigh scattering. When sunlight enters the Earth's atmosphere, it is made up of different colors, each with varying wavelengths. Blue light has a shorter wavelength compared to other colors like red or yellow.\n\nAs sunlight passes through the atmosphere, it interacts with air molecules and small particles. Because blue light is scattered in all directions more effectively than other colors due to its shorter wavelength, we perceive the sky as blue during the day.\n\nAt sunrise and sunset, the sun is lower on the horizon, and its light has to pass through a greater thickness of the atmosphere. This increased distance scatters the shorter blue wavelengths out of our line of sight, allowing the longer wavelengths, such as red and orange, to dominate the sky's appearance during those times."

Structured text generation

For programmatic responses, it’s often helpful to have the LLM returned structured responses where we can extract a variety of different types of information

json_completion <- function(prompt, max_tokens = 200) {
  # Get the response from the LLM
  response <- openai::create_chat_completion(
    model = "gpt-4o-mini",
    messages = list(list(role = "system", content = prompt)),
    temperature = 0.0,
    max_tokens = max_tokens
  )
  
  json <- response$choices$message.content;
  
  # Shilad: This is a hack to remove occassional responses that wrap the json with ``json... ```  in 4o-mini.
  # Ideally we would use {response_type : json_object} to avoid this but it's not supported by the R OpenAI wrapper.
  pattern <- regex("```json(.*?)```", dotall = TRUE);
  if (str_detect(json, pattern)) {
    json <- str_match(json, pattern)[, 2]; # Extract the matched JSON content
  }
  
  return (fromJSON(json));
}

Below you can find an example of calling this structured completion. The return value will be an object with fields response$attendees, etc.

response <- json_completion("
Your task is to extract a structured calendar invite by analyzing a short text.

The return value should be a JSON object with the following fields:
- attendees: A list of strings with attendee names.
- when: The starting date and time for the calendar event
- subject: Short description of the event
- description: Detailed few-sentence description of the event.

Perform this task for the following text:

Matthew and Ellen should meet Sunday at 4pm to discuss the future of the budget.
")
response
## $attendees
## [1] "Matthew" "Ellen"  
## 
## $when
## [1] "2023-10-29T16:00:00"
## 
## $subject
## [1] "Budget Discussion"
## 
## $description
## [1] "Matthew and Ellen will meet to discuss the future of the budget. This meeting will focus on planning and strategizing for upcoming financial decisions."

Task 2: Play with structured text generation and reflect on results

# Create your own example task (not a calendar invite) that produces structured output
# using the example above. Push the LLM with a hard example. Does it get it correct?

response <- json_completion("
Your task is to ....
")
response
# In your comments, reflect on: How might you use this for Data Science purposes?

2. Loading and exploring the Q&A dataset

In this assignment we will answer questions from the Reasoning 20k dataset. The dataset contains a set of challenging factual questions along with their answers. We will load the dataset, filter example questions, and then interact with the LLM to answer the questions. We chose this dataset because it’s created in October 2024 and the LLM could not possibly “cheat” by having trained on this data.

Load the Reasoning 20k Dataset

Download the JSON dataset to your computer and read it into a variable named reasoning_20k_df using code similar to the following:

reasoning_20k_df <- 
  as.data.frame(fromJSON("~/Downloads/combined_reasoning.json")) %>%
  select(user, assistant) %>%
  rename(question = user, answer = assistant) %>%
  mutate(id = row_number()) %>%
  relocate(id, .before = question);

format_table(reasoning_20k_df)
id question answer
1 Prove that the difference between two consecutive cubes cannot be divisible by 5, using the fact that the only possible remainders when a cube is divi… Let the two consecutive cubes be \(n^3\) and \((n+1)^3\). Their difference is:\[(n+1)^3 - n^3 = 3n^2 + 3n + 1.\]When \(n^3\) is divided by 5, the possible r…
2 How can I integrate the function \(\arcsin(\sqrt{x+1}-\sqrt{x})\)? Is there an easier way than using the formula $f(x),dx=x f(x)-_{…
3 Given the expression \(\frac{x^3+x+1}{(x^2+1)^2}\), decompose it into partial fractions. The decomposition of \(\frac{x^3+x+1}{(x^2+1)^2}\) can be directly observed as \(\dfrac{x}{x^2+1}+\dfrac{1}{(x^2+1)^2}\). This is because \(x^3+x\) can be f…
4 Is it true that a man named Mûrasi from India is 179 years old, as claimed by certain sources?Sources:- eface India- News Origin- World News Daily Rep… No, this claim is not accurate. The source of this information, the World News Daily Report, is known to publish fake news. They claim that Mûrasi was…
5 Find an example of a linear operator whose norm is not equal to the norm of its inverse. Consider the linear operator T from \((\mathbb{R}^2, \|\cdot\|_{sup})\) to \((\mathbb{R}^2, \|\cdot\|_1)\) defined by \(T(x,y) = (y,x)\). The norm of T is 1…

Task 1: Pick interesting questions

Now that we have the dataset loaded, pick some interesting questions to ask the LLM. Open the dataset in the built in R dataset viewer and search using the search field for 5-10 questions that interest you. Write down their ids and create a dataframe called interesting_df that contains just those questions

# These are questions that interest Shilad related to Music theory. 
# Pick ones that interest you. Locate them using the search function built in RStudio dataset viewer.
question_ids <- c(8248, 14377, 7769, 7311, 2568);
interesting_df <- reasoning_20k_df %>% filter(id %in% question_ids);
format_table(interesting_df)
id question answer
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations…
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals.
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to…
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di…
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings …

Carefully read through the questions and answers, and make sure they are accurate and make sense!

3. The row-mapping helper function

Throughout this activity we are going to use a helper function called map_table_rows that applies a function to each row of a dataframe and returns a new dataframe with the original columns and the new columns. This function is useful for applying the LLM to each row of a dataset.

We are going to use the pmap function from the furrr package which is a parallelized version to speed up the process.

map_table_rows <- function(df, mapping_function) {
  result <- future_pmap(df, mapping_function);
  return (cbind(df, bind_rows(result)));
}

Task 3. Understand the map_table_rows example

Below is an example of how to use the map_table_rows function. Take a look at the code and add a comment explaining what it is doing.

Add a second example that adds the square of x + y to the dataframe.

# Add a comment below indicating exactly what is happening
df <- data.frame(
  x = c(1, 2, 3, 4),
  y = c(5, 6, 7, 8),
  z = c(9, 10, 11, 12)
);

example_mapper <- function(x, y, ...) {
  return (list(
    sum = x + y,
    product = x * y
  ))
}

df %>% map_table_rows(example_mapper)
##   x y  z sum product
## 1 1 5  9   6       5
## 2 2 6 10   8      12
## 3 3 7 11  10      21
## 4 4 8 12  12      32
# Add a second example that adds the square of x + y to the dataframe.

To speed up this mapping, we will ask furrr to make up to work on 10 rows in parallel. This will mean we execute up to 10 LLM calls at the same time.

plan(multisession, workers = 10)
## Warning in checkNumberOfLocalWorkers(workers): Careful, you are setting up 10
## localhost parallel workers with only 8 CPU cores available for this R process
## (per 'system'), which could result in a 125% load. The soft limit is set to
## 100%. Overusing the CPUs has negative impact on the current R process, but also
## on all other processes of yours and others running on the same machine. See
## help("parallelly.options", package = "parallelly") for how to override the soft
## and hard limits

When you run this, depending on how many cores your laptop has, you may see a warning message indicating that this setting may saturate your CPU. Why is this unlikely when we use the parallel functions to interact with the OpenAI API?

4. Create a dataframe with predicted answers

Task 4: Create a dataframe with predicted answers

Create a new dataset called simple_answers that takes interesting_df and uses the vectorized function we just created to add a predicted column with the generated answer.

# Complete your implementation of the function below. 
# It should be a one-liner that calls the `completion` function.
# The name of the new column MUST be `predicted`
add_simple_answer <- function(question, ...) {
  return (list(predicted = completion(question)));
}

simple_answers <- 
  interesting_df %>%
  map_table_rows(add_simple_answer);

format_table(simple_answers)
id question answer predicted
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations… A music synthesizer built with a chain of astable multivibrator circuits can experience detuning over time for several reasons, even when using fixed …
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals. In an equal-tempered musical scale, there are 12 intervals in an octave. These intervals are typically referred to as semitones or half steps. Each se…
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to… To find the number of different groups of 4 that can be formed from 142 people, we can use the combination formula, which is given by:[C(n, r) = …
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di… To prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples, we need to analyze the …
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings … The debate over tuning musical instruments to 432Hz versus the standard 440Hz is a topic of interest among musicians, sound therapists, and some alter…

Take a look at the predicted answers and compare them to the actual answers. What do you notice?

5. Evaluate predicted answers

To evaluate the quality of responses, we would traditionally use human experts to grade them. The rise of LLMs offers a new approach for labeling datasets using LLMs. The LLM-as-Judge paradigm uses the LLM to evaluate responses, providing feedback and scoring based on correctness and completeness. This allows us to assess the performance of the LLM or other models over a set of questions.

Think to yourself: What are the costs and benefits of using LLM-as-judge vs humans? When may it make sense to use one vs. the other?

Below is code that takes a dataframe and returns the grades for each question:

grade_predicted_answer <- function(question, answer, predicted, ...) {
   prompt <- paste(
        "
        You are an expert grader.
        Evaluate the student's answer to the following question.
        Provide your response as a JSON object with the following attributes:
        - feedback: A brief summary of feedback on the correctness of the student's answer.
        - score: A score out of 10 based on the quality of the student's response.
        
        Perform this task for the following question:
        ",
        toJSON(
          list(question = question, answer = answer, predicted = predicted),
          auto_unbox = TRUE
        )
   );
   grading_response <- json_completion(prompt);
   return (list(
     score = grading_response$score,
     feedback = grading_response$feedback
   ));
};

Task 5: Grade the simple scores

Use the function above along with map_table_rows to assign grades to each predicted answer. In comments

# Use map_table_rows and grade_predicted_answer to assign grades to simple_answers
simple_answers %>% 
  map_table_rows(grade_predicted_answer) %>%
  format_table()
id question answer predicted score feedback
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations… A music synthesizer built with a chain of astable multivibrator circuits can experience detuning over time for several reasons, even when using fixed … 7 The student’s answer correctly identifies several factors that can lead to detuning in a music synthesizer built with astable multivibrators, such as …
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals. In an equal-tempered musical scale, there are 12 intervals in an octave. These intervals are typically referred to as semitones or half steps. Each se… 8 The student’s answer is correct in stating that there are 12 intervals in an octave in an equal-tempered scale. However, it could be improved by menti…
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to… To find the number of different groups of 4 that can be formed from 142 people, we can use the combination formula, which is given by:[C(n, r) = … 7 The student correctly identifies the combination formula and applies it to calculate the number of groups of 4 from 142 people. However, the answer is…
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di… To prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples, we need to analyze the … 8 The student’s answer correctly identifies the mathematical relationship between the powers of 3 and 2, and effectively uses the Fundamental Theorem of…
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings … The debate over tuning musical instruments to 432Hz versus the standard 440Hz is a topic of interest among musicians, sound therapists, and some alter… 8 The student’s answer accurately addresses the question by highlighting the lack of scientific evidence supporting the benefits of 432Hz tuning over 44…
# Look at the scores and feedback. Does it make sense to you?

6. Create Chain of Thought answers

Chain of Thought is a prompting technique where you ask the LLM to work through the problem in structured steps.This can be useful for generating more detailed answers or exploring a topic in depth.

Task 6: Implementing Chain of Thought Prompts

Apply the pattern you see in the grading function to create a new data frame called cot_answers (for “Chain of Thought”). * Use the json_completion function to generate a more detailed response to each question. You will need to raise the second argument from the default number of tokens to something higher (e.g. 1000). * You should ask the LLM to produce a JSON object with the following fields: - plan: A step-by-step plan for solving the problem. - details: A detailed step-by-step solution to the problem. - answer: The final answer to the question. * Note that the order of the fields in the JSON object is important. You must force the LLM to generate output in the order it “thinks.”

add_cot_answer <- function(question, ...) {
   prompt <- paste(
      "
      Your task is to provide a detailed response to the following question.
      Generate a JSON object with the following fields:
      - plan: A single string containing step-by-step plan for solving the problem.
      - details: A single string containing detailed step-by-step solution to the problem.
      - answer: The single string containing final answer to the question.
      
      Perform this task for the following question:
      ",
      question
   );
   cot_response <- json_completion(prompt, max_tokens = 1000);
   return (list(
     plan = cot_response$plan,
     details = cot_response$details,
     predicted = cot_response$answer
   ));
};

cot_answers <- 
  interesting_df %>%
  map_table_rows(add_cot_answer);

format_table(cot_answers)
id question answer plan details predicted
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations…
  1. Understand the function of astable multivibrators in synthesizers. 2. Investigate the factors that can cause detuning in electronic circuits. 3. An…
  1. Astable multivibrators are circuits that generate a continuous square wave output, commonly used in synthesizers to create audio signals. They rely…
A music synthesizer built with a chain of astable multivibrator circuits detunes after a few hours due to temperature variations affecting resistor an…
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals.
  1. Understand the concept of an octave in music theory. 2. Define what an interval is in the context of a musical scale. 3. Identify the number of sem…
An octave in music theory is the interval between one musical pitch and another with double its frequency. In an equal-tempered scale, an octave is di… There are 12 intervals in an octave in an equal-tempered musical scale.
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to…
  1. Identify the total number of participants (n = 142). 2. Use the combination formula to calculate the number of ways to choose 4 people from 142. 3….
  1. We have 142 participants, so n = 142. 2. We want to form groups of 4, so r = 4. 3. The combination formula is C(142, 4) = 142! / (4! * (142 - 4)!)….
16242880
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di…
  1. Understand the concept of just intonation and pure fifths. 2. Define the mathematical equation to be proven. 3. Analyze the left-hand side and righ…
  1. Just intonation is a system of tuning based on ratios of whole numbers, where a pure fifth is represented by the ratio 3:2. Stacking pure fifths me…
There are no positive integer solutions for the equation (3/2)^m = (1/2)^n, proving that stacking just intonation pure fifths will never result in a p…
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings …
  1. Research the history and context of 432Hz and 440Hz tuning. 2. Investigate scientific studies on the effects of different tuning frequencies on hum…
  1. The history of musical tuning reveals that 440Hz became the standard tuning frequency in the mid-20th century, while 432Hz has been associated with…
There is no significant scientific evidence to support that tuning musical instruments to 432Hz provides benefits to human well-being compared to the …

Task 7: Grade Chain of Thought Prompts

Finally, use the same procedure you did earlier to grade the new responses. Do you see any interesting differences?

cot_answers %>% 
  map_table_rows(grade_predicted_answer) %>%
  format_table()
id question answer plan details predicted score feedback
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations…
  1. Understand the function of astable multivibrators in synthesizers. 2. Investigate the factors that can cause detuning in electronic circuits. 3. An…
  1. Astable multivibrators are circuits that generate a continuous square wave output, commonly used in synthesizers to create audio signals. They rely…
A music synthesizer built with a chain of astable multivibrator circuits detunes after a few hours due to temperature variations affecting resistor an… 8 The student’s answer provides a comprehensive explanation of the factors contributing to detuning in a music synthesizer, including power supply fluct…
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals.
  1. Understand the concept of an octave in music theory. 2. Define what an interval is in the context of a musical scale. 3. Identify the number of sem…
An octave in music theory is the interval between one musical pitch and another with double its frequency. In an equal-tempered scale, an octave is di… There are 12 intervals in an octave in an equal-tempered musical scale. 10 The student’s answer is correct and accurately states that an octave in an equal-tempered musical scale is divided into 12 intervals. The phrasing is …
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to…
  1. Identify the total number of participants (n = 142). 2. Use the combination formula to calculate the number of ways to choose 4 people from 142. 3….
  1. We have 142 participants, so n = 142. 2. We want to form groups of 4, so r = 4. 3. The combination formula is C(142, 4) = 142! / (4! * (142 - 4)!)….
16242880 7 The student correctly explained the combination formula and applied it to the problem. However, the final calculation of the number of groups is incor…
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di…
  1. Understand the concept of just intonation and pure fifths. 2. Define the mathematical equation to be proven. 3. Analyze the left-hand side and righ…
  1. Just intonation is a system of tuning based on ratios of whole numbers, where a pure fifth is represented by the ratio 3:2. Stacking pure fifths me…
There are no positive integer solutions for the equation (3/2)^m = (1/2)^n, proving that stacking just intonation pure fifths will never result in a p… 7 The student’s answer correctly identifies that there are no positive integer solutions to the equation, but it lacks clarity in the mathematical reaso…
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings …
  1. Research the history and context of 432Hz and 440Hz tuning. 2. Investigate scientific studies on the effects of different tuning frequencies on hum…
  1. The history of musical tuning reveals that 440Hz became the standard tuning frequency in the mid-20th century, while 432Hz has been associated with…
There is no significant scientific evidence to support that tuning musical instruments to 432Hz provides benefits to human well-being compared to the … 8 The student’s answer accurately addresses the question by stating that there is insufficient scientific evidence to support the benefits of 432Hz tuni…

7. Use RAG to ground answers

The RAG technique is a prompting technique where you ask the LLM to generate a response, then ask it to generate a response that is better than the first response, and then ask it to generate a response that is better than the second response. This can be useful for iteratively improving answers.

Add on article text to the data frame

To perform RAG, we will ask the LLM to first generate search queries based on the questions.

# Step 1: Get search keywords
add_keywords <- function(question, ...) {
  # Extract 
   prompt <- paste(
      "
      Your task is to identify three diverse, detailed search queries to gather information from Wikipedia about the following question.
      The queries are going to be run through Wikipedia's internal search engine to retrieve relevant entire article. 
      So make sure they are in the \"goldilocks\" zone of specificity.
      They should be specific enough to identify the best related article, but not more specific than that.
      Provide your response as a JSON object with the following attributes:
      - query1: A string containing the first search query
      - query2: A string containing the second search query
      - query3: A string containing the third search query
      ",
      question
   );
   query_response <- json_completion(prompt, max_tokens = 500);
   return (list(
     query1 = query_response$query1,
     query2 = query_response$query2,
     query3 = query_response$query3
   ));
}


# Add keywords to the dataset
rag_df <- 
  interesting_df %>% 
  map_table_rows(add_keywords);
rag_df %>% format_table()
id question answer query1 query2 query3
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations… astable multivibrator circuits in music synthesizers synthesizer tuning stability and temperature effects effects of resistor values on synthesizer circuit performance
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals. equal-tempered musical scale intervals in an octave octave structure in music theory number of semitones in an octave
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to… combinatorial mathematics groups of four combinations formula example binomial coefficient calculation
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di… just intonation and pure fifths music theory mathematical proof stacking fifths octave tuning properties of just intonation and octave equivalence
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings … 432Hz tuning benefits human well-being comparison of 432Hz and 440Hz musical tuning effects of musical tuning frequencies on health

Next, we use these search queries to fetch the article text from Wikipedia. After doing so, we ask the LLM to summarize the salient facts from the articles that are relevant to the question.

# Step 2: Get article text for each query
add_articles <- function(question, query1, query2, query3, ...) {
  # Get the article text for each query
  article1 <- search_wikipedia(query1)
  article2 <- search_wikipedia(query2)
  article3 <- search_wikipedia(query3)
  
  # Use the LLM to extract key facts the LLM should focus on.
  facts <- completion(paste(
    "You are an expert synthesizer. Your job is to extract the most salient facts from three different articles related to a question.",
    "Write three to five sentences capturing the most important facts needed to answer the question.",
    toJSON(list(question = question, articles = list(article1$text, article2$text, article3$text)))
  ))
  
  # Return the results
  return (list(
     article_title1 = article1$title,
     article_title2 = article2$title,
     article_title3 = article3$title,
     article_text1 = article1$text,
     article_text2 = article2$text,
     article_text3 = article3$text,
     facts=facts
   ));
}


# Add keywords to the dataset
rag_facts_df <- 
  rag_df %>% 
  map_table_rows(add_articles);
rag_facts_df %>% 
  format_table()
id question answer query1 query2 query3 article_title1 article_title2 article_title3 article_text1 article_text2 article_text3 facts
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations… astable multivibrator circuits in music synthesizers synthesizer tuning stability and temperature effects effects of resistor values on synthesizer circuit performance Feedback Crystal oscillator Thermistor Feedback occurs when outputs of a system are routed back as inputs as part of a chain of cause-and-effect that forms a circuit or loop. The system can… A crystal oscillator is an electronic oscillator circuit that uses a piezoelectric crystal as a frequency-selective element. The oscillator frequency … A thermistor is a semiconductor type of resistor whose resistance is strongly dependent on temperature, more so than in standard resistors. The word t… A music synthesizer built with a chain of astable multivibrator circuits can detune after a few hours due to several factors. Even with fixed resistor…
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals. equal-tempered musical scale intervals in an octave octave structure in music theory number of semitones in an octave Equal temperament Music and mathematics Scale (music) An equal temperament is a musical temperament or tuning system that approximates just intervals by dividing an octave (or other interval) into steps s… Music theory analyzes the pitch, timing, and structure of music. It uses mathematics to study elements of music such as tempo, chord progression, form… In music theory, a scale is “any consecutive series of notes that form a progression between one note and its octave”, typically by order of pitch or … In an equal-tempered musical scale, specifically the most common system known as 12-tone equal temperament (12 TET), an octave is divided into 12 equa…
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to… combinatorial mathematics groups of four combinations formula example binomial coefficient calculation Combinatorics Combination Binomial coefficient Combinatorics is an area of mathematics primarily concerned with counting, both as a means and as an end to obtaining results, and certain properties … In mathematics, a combination is a selection of items from a set that has distinct members, such that the order of selection does not matter (unlike p… In mathematics, the binomial coefficients are the positive integers that occur as coefficients in the binomial theorem. Commonly, a binomial coefficie… To determine how many different groups of 4 can be formed from 142 people, we use the concept of combinations in combinatorics. The number of ways to …
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di… just intonation and pure fifths music theory mathematical proof stacking fifths octave tuning properties of just intonation and octave equivalence Just intonation List of guitar tunings Semitone In music, just intonation or pure intonation is the tuning of musical intervals as whole number ratios (such as 3:2 or 4:3) of frequencies. An interva… This article contains a list of guitar tunings that supplements the article guitar tunings. In particular, this list contains more examples of open an… A semitone, also called a minor second, half step, or a half tone, is the smallest musical interval commonly used in Western tonal music, and it is co… In music theory, stacking a series of just intonation pure fifths (ratios of 3:2) will never yield a perfectly tuned octave (ratio of 2:1) or its mult…
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings … 432Hz tuning benefits human well-being comparison of 432Hz and 440Hz musical tuning effects of musical tuning frequencies on health NA NA Psychoacoustics NA NA Psychoacoustics is the branch of psychophysics involving the scientific study of the perception of sound by the human auditory system. It is the branc… Tuning musical instruments to 432Hz is often claimed to provide benefits to human well-being, with proponents suggesting it is more “natural” than the…

Finally, we use these facts to answer the questions.

add_rag_answer <- function(question, facts, ...) {
   prompt <- paste(
      "
      Your task is to provide a detailed response to the following question.
      You will be provided with facts from Wikipedia articles that may (or may not!) be relevant to the question.
      Read the question and then the three articles, and then generate a response by generate a JSON object with the following fields:
      - plan: A single string containing step-by-step plan for solving the problem.
      - details: A single string containing detailed step-by-step solution to the problem.
      - answer: The single string containing final answer to the question.
      
      Perform this task for the following question:
      ",
      toJSON(list(question = question, facts = facts))
   );
   rag_response <- json_completion(prompt, max_tokens = 2000);
   return (list(
     plan = rag_response$plan,
     details = rag_response$details,
     predicted = rag_response$answer
   ));
};

rag_answers <- 
  rag_facts_df %>%
  mutate_if(is.character, list(~na_if(.,""))) %>% # Remove NA values
  map_table_rows(add_rag_answer);

rag_answers %>% format_table()
id question answer query1 query2 query3 article_title1 article_title2 article_title3 article_text1 article_text2 article_text3 facts plan details predicted
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations… astable multivibrator circuits in music synthesizers synthesizer tuning stability and temperature effects effects of resistor values on synthesizer circuit performance Feedback Crystal oscillator Thermistor Feedback occurs when outputs of a system are routed back as inputs as part of a chain of cause-and-effect that forms a circuit or loop. The system can… A crystal oscillator is an electronic oscillator circuit that uses a piezoelectric crystal as a frequency-selective element. The oscillator frequency … A thermistor is a semiconductor type of resistor whose resistance is strongly dependent on temperature, more so than in standard resistors. The word t… A music synthesizer built with a chain of astable multivibrator circuits can detune after a few hours due to several factors. Even with fixed resistor…
  1. Identify the factors that can cause detuning in a music synthesizer. 2. Explain how temperature changes affect the components in the circuit. 3. Di…
To understand why a music synthesizer built with a chain of astable multivibrator circuits detunes after a few hours, we need to consider several fact… A music synthesizer built with a chain of astable multivibrator circuits detunes after a few hours due to temperature changes affecting resistance and…
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals. equal-tempered musical scale intervals in an octave octave structure in music theory number of semitones in an octave Equal temperament Music and mathematics Scale (music) An equal temperament is a musical temperament or tuning system that approximates just intervals by dividing an octave (or other interval) into steps s… Music theory analyzes the pitch, timing, and structure of music. It uses mathematics to study elements of music such as tempo, chord progression, form… In music theory, a scale is “any consecutive series of notes that form a progression between one note and its octave”, typically by order of pitch or … In an equal-tempered musical scale, specifically the most common system known as 12-tone equal temperament (12 TET), an octave is divided into 12 equa…
  1. Identify the type of musical scale mentioned in the question. 2. Determine how many intervals are in an octave according to the facts provided. 3. …
The question asks about the number of intervals in an octave within an equal-tempered musical scale. The relevant fact states that in the 12-tone equa… There are 12 intervals in an octave in an equal-tempered musical scale.
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to… combinatorial mathematics groups of four combinations formula example binomial coefficient calculation Combinatorics Combination Binomial coefficient Combinatorics is an area of mathematics primarily concerned with counting, both as a means and as an end to obtaining results, and certain properties … In mathematics, a combination is a selection of items from a set that has distinct members, such that the order of selection does not matter (unlike p… In mathematics, the binomial coefficients are the positive integers that occur as coefficients in the binomial theorem. Commonly, a binomial coefficie… To determine how many different groups of 4 can be formed from 142 people, we use the concept of combinations in combinatorics. The number of ways to …
  1. Identify the total number of participants (n = 142) and the group size (k = 4). 2. Use the formula for combinations to calculate the number of ways…
To find the number of different groups of 4 that can be formed from 142 people, we will use the combinations formula, which is given by C(n, k) = n! /… 16234517
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di… just intonation and pure fifths music theory mathematical proof stacking fifths octave tuning properties of just intonation and octave equivalence Just intonation List of guitar tunings Semitone In music, just intonation or pure intonation is the tuning of musical intervals as whole number ratios (such as 3:2 or 4:3) of frequencies. An interva… This article contains a list of guitar tunings that supplements the article guitar tunings. In particular, this list contains more examples of open an… A semitone, also called a minor second, half step, or a half tone, is the smallest musical interval commonly used in Western tonal music, and it is co… In music theory, stacking a series of just intonation pure fifths (ratios of 3:2) will never yield a perfectly tuned octave (ratio of 2:1) or its mult…
  1. Understand the ratios involved in just intonation and octaves. 2. Rewrite the equation \((\frac{3}{2})^m = (\frac{1}{2})^n\) in a more manageable…
  1. In music theory, just intonation uses the ratio of 3:2 for pure fifths and 2:1 for octaves. 2. The equation \((\frac{3}{2})^m = (\frac{1}{2})^n\)
Stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples, as demonstrated by the lack of positi…
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings … 432Hz tuning benefits human well-being comparison of 432Hz and 440Hz musical tuning effects of musical tuning frequencies on health NA NA Psychoacoustics NA NA Psychoacoustics is the branch of psychophysics involving the scientific study of the perception of sound by the human auditory system. It is the branc… Tuning musical instruments to 432Hz is often claimed to provide benefits to human well-being, with proponents suggesting it is more “natural” than the…
  1. Analyze the claims about 432Hz tuning and its benefits. 2. Review the scientific evidence regarding the effects of tuning frequencies on human well…
First, we need to examine the claims made by proponents of 432Hz tuning, who argue that it is more natural and beneficial for human well-being compare… There is no significant scientific evidence to support the claim that tuning musical instruments to 432Hz provides benefits to human well-being compar…

And evaluate the results using LLM-as-judge.

rag_answers %>% 
  map_table_rows(grade_predicted_answer) %>%
  format_table()
id question answer query1 query2 query3 article_title1 article_title2 article_title3 article_text1 article_text2 article_text3 facts plan details predicted score feedback
2568 Why does a music synthesizer built with a chain of astable multivibrator circuits detune after a few hours, even with fixed resistor values instead of… The detuning in your analog synthesizer is due to various factors that affect the oscillation frequency over time:1. Power supply voltage fluctuations… astable multivibrator circuits in music synthesizers synthesizer tuning stability and temperature effects effects of resistor values on synthesizer circuit performance Feedback Crystal oscillator Thermistor Feedback occurs when outputs of a system are routed back as inputs as part of a chain of cause-and-effect that forms a circuit or loop. The system can… A crystal oscillator is an electronic oscillator circuit that uses a piezoelectric crystal as a frequency-selective element. The oscillator frequency … A thermistor is a semiconductor type of resistor whose resistance is strongly dependent on temperature, more so than in standard resistors. The word t… A music synthesizer built with a chain of astable multivibrator circuits can detune after a few hours due to several factors. Even with fixed resistor…
  1. Identify the factors that can cause detuning in a music synthesizer. 2. Explain how temperature changes affect the components in the circuit. 3. Di…
To understand why a music synthesizer built with a chain of astable multivibrator circuits detunes after a few hours, we need to consider several fact… A music synthesizer built with a chain of astable multivibrator circuits detunes after a few hours due to temperature changes affecting resistance and… 8 The student’s answer provides a comprehensive explanation of the factors contributing to detuning in a music synthesizer, including power supply fluct…
7311 In an equal-tempered musical scale, how many intervals are there in an octave? An octave in an equal-tempered musical scale is divided into 12 intervals. equal-tempered musical scale intervals in an octave octave structure in music theory number of semitones in an octave Equal temperament Music and mathematics Scale (music) An equal temperament is a musical temperament or tuning system that approximates just intervals by dividing an octave (or other interval) into steps s… Music theory analyzes the pitch, timing, and structure of music. It uses mathematics to study elements of music such as tempo, chord progression, form… In music theory, a scale is “any consecutive series of notes that form a progression between one note and its octave”, typically by order of pitch or … In an equal-tempered musical scale, specifically the most common system known as 12-tone equal temperament (12 TET), an octave is divided into 12 equa…
  1. Identify the type of musical scale mentioned in the question. 2. Determine how many intervals are in an octave according to the facts provided. 3. …
The question asks about the number of intervals in an octave within an equal-tempered musical scale. The relevant fact states that in the 12-tone equa… There are 12 intervals in an octave in an equal-tempered musical scale. 10 The student’s answer is correct and accurately states that an octave in an equal-tempered musical scale is divided into 12 intervals. The phrasing is …
7769 In a musical event called extreme quarteting, 142 people participated. How many different groups of 4 can be formed from these 142 people? To calculate the number of different groups of 4 that can be formed from 142 people, we use the combination formula, which gives the number of ways to… combinatorial mathematics groups of four combinations formula example binomial coefficient calculation Combinatorics Combination Binomial coefficient Combinatorics is an area of mathematics primarily concerned with counting, both as a means and as an end to obtaining results, and certain properties … In mathematics, a combination is a selection of items from a set that has distinct members, such that the order of selection does not matter (unlike p… In mathematics, the binomial coefficients are the positive integers that occur as coefficients in the binomial theorem. Commonly, a binomial coefficie… To determine how many different groups of 4 can be formed from 142 people, we use the concept of combinations in combinatorics. The number of ways to …
  1. Identify the total number of participants (n = 142) and the group size (k = 4). 2. Use the formula for combinations to calculate the number of ways…
To find the number of different groups of 4 that can be formed from 142 people, we will use the combinations formula, which is given by C(n, k) = n! /… 16234517 7 The student correctly explained the combination formula and applied it to the problem. However, the final calculation of the number of groups is incor…
8248 In music theory, prove that stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples. Mathemati… To prove this, we start with the given equation: $ ()^m = ()^n $. We can rewrite this as $ 3^m = 2^{m-n} $. Since 2 and 3 are di… just intonation and pure fifths music theory mathematical proof stacking fifths octave tuning properties of just intonation and octave equivalence Just intonation List of guitar tunings Semitone In music, just intonation or pure intonation is the tuning of musical intervals as whole number ratios (such as 3:2 or 4:3) of frequencies. An interva… This article contains a list of guitar tunings that supplements the article guitar tunings. In particular, this list contains more examples of open an… A semitone, also called a minor second, half step, or a half tone, is the smallest musical interval commonly used in Western tonal music, and it is co… In music theory, stacking a series of just intonation pure fifths (ratios of 3:2) will never yield a perfectly tuned octave (ratio of 2:1) or its mult…
  1. Understand the ratios involved in just intonation and octaves. 2. Rewrite the equation \((\frac{3}{2})^m = (\frac{1}{2})^n\) in a more manageable…
  1. In music theory, just intonation uses the ratio of 3:2 for pure fifths and 2:1 for octaves. 2. The equation \((\frac{3}{2})^m = (\frac{1}{2})^n\)
Stacking a series of just intonation pure fifths will never result in a perfectly tuned octave or its multiples, as demonstrated by the lack of positi… 7 The student’s answer correctly identifies the mathematical reasoning behind the lack of solutions to the equation, but it could be clearer in its expl…
14377 Does tuning musical instruments to 432Hz provide any significant benefits to human well-being or is it more “natural” compared to the standard 440Hz t… There is currently insufficient scientific evidence to support the claim that 432Hz tuning is objectively superior or more beneficial to human beings … 432Hz tuning benefits human well-being comparison of 432Hz and 440Hz musical tuning effects of musical tuning frequencies on health NA NA Psychoacoustics NA NA Psychoacoustics is the branch of psychophysics involving the scientific study of the perception of sound by the human auditory system. It is the branc… Tuning musical instruments to 432Hz is often claimed to provide benefits to human well-being, with proponents suggesting it is more “natural” than the…
  1. Analyze the claims about 432Hz tuning and its benefits. 2. Review the scientific evidence regarding the effects of tuning frequencies on human well…
First, we need to examine the claims made by proponents of 432Hz tuning, who argue that it is more natural and beneficial for human well-being compare… There is no significant scientific evidence to support the claim that tuning musical instruments to 432Hz provides benefits to human well-being compar… 9 The student’s answer accurately addresses the question by highlighting the lack of scientific evidence supporting the benefits of 432Hz tuning over 44…

Conclusion

Through this activity, you’ve learned how to:

Understanding how to use LLMs for grading can help in assessing model performance and automating evaluation tasks.

Further Exploration

References