Using “AI” for data analysis

  • ChatGPT et al. do not have super-human intelligence.
  • They are large language models (LLMs), i.e. statistical models of human language trained on data-sets of trillions of words of text.
  • They learn statistical patterns in language, which can allow them to extrapolate text on new topics in response to our queries.
  • Information can be incorrect. For example, ChatGPT may provide code involving some functions or commands in R that simply do not exist.
  • Just copying and pasting code without any understanding is likely going to go wrong eventually.

Using “AI” for data analysis

NTU policy: “Generative Artificial Intelligence (GenAI) should be used to support and enhance your learning, not to replace independent thinking, critical analysis, or academic integrity.”

  • We encourage you to use LLMs such a ChatGPT to enhance your learning.
  • Excellent tools for learning about how to do (new) things in R and about data analysis generally.

Overview

  1. The quality of the response you receive depends, among other things, on your prompt. How do you write AI prompts for data analysis?
  2. Other than Copilot and ChatGPT, there are AI tools specifically for data analysis and RStudio. We will have a go at Julius AI.

You’ll need access to

Writing a Good Prompt

A well-crafted prompt helps a lot:

  • What you want (task clarity)
  • How you want it done (method or constraints)
  • What context matters (data, packages, goals)

Role, Goal, Constraint, Data (RGCD) approach

  • Role: Define the role of the LLM. Who / what do you want the LLM to be?
  • Goal: What goal / task do you want help with?
  • Constraint: Outline any constraints or guidelines that you would like the LLM to follow.
  • Data: Describe any specific data or background context to consider.

Task 1: Create a scatterplot with pumpkins.

Try to use this prompt in Copilot / ChatGPT:

“I need you to write R code for me. I have limited experience in data visualisation (role). I want to create a Halloween-themed scatterplot with pumpkins instead of dots (goal). I need to be able to generate the scatterplot using code in RStudio (constraint). Please use the R package ggplot2 to create the code for this scatterplot (data).”

I received two responses; neither of them worked (but I fixed them):

  • Response 1: had to install ggtext
  • Response 2: couldn’t find url to pumpkin picture

Task 1: Response 1

# Load required packages
library(ggplot2)
library(ggtext)  # <- had to install this

# Create sample data
set.seed(123)
df <- data.frame(
  x = rnorm(20),
  y = rnorm(20),
  label = rep("🎃", 20)
)

# Plot using text labels as points
ggplot(df, aes(x = x, y = y)) +
  geom_text(aes(label = label), size = 8) +
  theme_minimal() +
  labs(title = "Halloween Scatterplot", 
       subtitle = "Pumpkins instead of dots!") +
  theme(
    plot.title = element_text(face = "bold", 
                              size = 16),
    plot.subtitle = element_text(size = 12)
  )

Task 1: Response 2

# Install if needed
# install.packages("ggimage")

library(ggplot2)
library(ggimage)

# Sample data
df <- data.frame(
  x = rnorm(20),
  y = rnorm(20),
  image = "gfx/pumpkin.png" # <- had to fix this!
)

# Plot with image
ggplot(df, aes(x = x, y = y)) +
  geom_image(aes(image = image), size = 0.2) +
  theme_minimal() +
  labs(title = "Halloween Scatterplot", 
       subtitle = "Pumpkin images as points")

Prompt Writing Strategies: Be Specific and Use Natural Language

Instead of:

“Help me with my R code”

Try:

“I have a dataset with columns age, smoker, and score. I want to run a linear regression predicting score from age and smoker using lm() in R. Can you write the code?”

Prompt Writing Strategies: Include Context

  • What packages you’re using (tidyverse, ggplot2, psyntur etc.)
  • What the data looks like (a few rows or column names)
  • What error you’re getting (copy-paste the message)

“I’m using ggplot2 to plot a histogram of score, but I get the error: object 'score' not found. Here’s my code…”

Task 2: Prompt Writing Strategies

Use Copilot to create a simple R script that loads and visualises data:

Load Halloween candy-ratings dataset:

# Load package
library(tidyverse)
# Load data
candy_data <- read_csv("https://query.data.world/s/zvkxyzlcxtegs3j6dsz3gjw2zloxil?dws=00000")

Variables:

  • winpercent (continuous): % of times a candy was preferred
  • chocolate, fruity, caramel, peanutyalmondy, etc. (binary categorical)
  • pricepercent (continuous): relative cost

Task:

Using the ggplot2 package create a scatterplot that shows the relationship of sugar percentage and price percentage. Instead of dots show the name of the candy competitorname in the plot.

When you write your prompt, tell ChatGPT / Copilot your variable names.

Prompt Writing Strategies: Fixing errors

This code

# Load data
candy_data <- read_csv("https://query.data.world/s/zvkxyzlcxtegs3j6dsz3gjw2zloxil?dws=00000")

# Create scatterplot
ggplot(candy_dta, aes(x = sugarpercent, y = pricepercent, label = competitorname)) +
  geom_text()

returns the error Error in ggplot: object 'candy_dta' not found

Craft a prompt for Copilot / ChatGPT to fix this error!

For example

“I tried to plot candy preferences using ggplot2 but I got an error saying 'candy_data' not found. Here’s my code. Can you help me fix it?”

Prompt Writing Strategies: Ask for Explanations

“Can you explain what this line of code does?”

“Why do we use geom_text() here?”

Task 3: Prompt Writing Strategies: Break Down Tasks

Instead of asking for everything at once, guide the LLM step-by-step:

  1. Create a linear regression model that predicts price from sugar percentage.
  2. Get help to interpret the linear regression output.
  3. Ask about potential model violations for this analysis.

Write a prompt to obtain code that saves the candy_data as a file (e.g. in CSV format) on your computer (or in your RStudio Cloud project). For the next task we need the candy data set saved on your machine so we can upload it elsewhere.

AI Tools for RStudio & Data Analysis

Gander

  • RStudio add-in that acts like a lightweight Copilot.
  • Context-aware suggestions based on script and environment.
  • Can use models like Claude Sonnet.
  • Great for debugging and code generation.
  • Not ideal for using RStudio Cloud.

Julius AI

  • Web-based assistant for statistical analysis.
  • Supports data cleaning, visualisation, and statistical tests.
  • Complements RStudio for quick insights without installing anything.


Workik AI

  • Online R code generator using GPT models.
  • Supports libraries like tidyverse, ggplot2, caret, and shiny.
  • Useful for generating code for data wrangling, modeling, and reporting.
  • Offers collaborative features.

YesChat AI

  • AI assistant embedded in RStudio.
  • Helps with syntax, debugging, and optimisation.
  • Powered by ChatGPT-4o, tailored for beginners and researchers.

Zencoder

  • Full-featured AI coding agent with support for testing and code review.

Task 4: Getting Started with Julius AI

  • Go to Julius AI; sign up using your email, Google, or Apple account.
  • Select R as code language (top left area).
  • Upload candy data dataset (dataset should have clear column headers).

Write prompts to describe what you want.

  • Data cleaning: “Create a new column that distinguishes healthy from unhealthy candies on the basis of the”sugarpercent” information.”
  • Descriptive stats: “Show summary statistics for winpercent separated by the health and unhealthy candies.”
  • Inferential tests: “Run a t-test comparing candy ratings of the healthy and unhealhy group.”
  • Visualisation: “Create a scatterplot with candy preferences using the winpercent and sugarpercent data separated by healthiness using ggplot2 code.”
  • Modeling: “Fit a linear regression predicting candy rating from healthiness and pricepercent.”

Review results / code, make corrections if needed: Download cleaned data, charts, report (“Generate Analysis Report” on the top right).

Tips

  • Be specific: Mention variables and desired analysis.
  • Use natural language: No need to know syntax.
  • Ask for explanations: “Explain what this regression output means.”
  • Again, use LLMs but don’t trust them!

Task 5: Practicing prompting

ChatGPT / Copilot and write prompts for either task:

  • 5a. You want R code to create a scatterplot of the candy data. As background the plot should have a spooky theme with colours and font types appropriate for Halloween (also e.g. the image of haunted house as background).
  • 5b. Beyond coding, ChatGPT / Copilot can also be used to generate stimulus materials. E.g. you are a language psychologist and you want to test how quickly participants manage to dissociate real words (e.g. “chat”) from fake words (e.g. “ctah”). You believe that people are faster for words that have more than one meaning (e.g. “bank”). Create lists of words that can be used for such an experiment.