RAINCLOUD PLOT ANALYSIS

G CHAITANYA and HARSHA MV

2026-04-14

Problem Statement: Create a raincloud plot (combining boxplot and density plot) using total_bill as the numerical variable and day as the categorical variable to analyze the distribution of total bill across different days.

What We Are Trying to Do

We want to understand how the total bill amount varies across different days of the week in a restaurant dataset. Instead of just using a plain boxplot, we are using a raincloud plot which gives a much better picture of the data.

A raincloud plot combines two things — a density curve that shows the shape of the data, and a boxplot that gives the summary statistics. Together they tell a more complete story about the distribution.

The two main variables we are working with are total_bill which is numeric, and day which is categorical with four values — Thursday, Friday, Saturday and Sunday.

Step 1: Loading the Libraries

library(ggplot2)
library(ggdist)

We are loading two libraries here. ggplot2 is the main plotting library in R that lets us build visualizations layer by layer. ggdist is an extension of ggplot2 that gives us access to special distribution plots like the half-eye density which we need for the raincloud effect.

Without ggdist, we cannot create the density part of the raincloud plot, so both libraries are needed together.

Step 2: Loading the Dataset

data <- read.csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")

We are reading the tips dataset directly from a URL using read.csv(). This is a well known dataset that contains restaurant billing information collected over different days and times.

The dataset has 244 rows. The column we care about most is total_bill which stores the bill amount in dollars, and day which tells us which day the meal happened — Thursday, Friday, Saturday or Sunday.

Step 3: Exploring the Dataset

dim(data)
## [1] 244   7
names(data)
## [1] "total_bill" "tip"        "sex"        "smoker"     "day"       
## [6] "time"       "size"
summary(data[, c("total_bill", "day")])
##    total_bill        day           
##  Min.   : 3.07   Length:244        
##  1st Qu.:13.35   Class :character  
##  Median :17.80   Mode  :character  
##  Mean   :19.79                     
##  3rd Qu.:24.13                     
##  Max.   :50.81

Before we jump into plotting, it is always a good idea to look at the data first. dim() tells us the size of the dataset, names() shows all the column names, and summary() gives us basic statistics like the minimum, maximum and median values.

From the summary we can already see that the total bill ranges from about 3 dollars to 50 dollars, and Saturday has the most number of entries in the dataset.

Step 4: Setting Up the Base Plot

p <- ggplot(data, aes(x = day, y = total_bill))
p

Here we are just setting up the base canvas for our plot. We tell ggplot that day goes on the x-axis and total_bill goes on the y-axis. Nothing appears on the plot yet except the axes because we have not added any geometric layers.

This is how ggplot2 works — you build the plot one layer at a time, starting from this base and adding components on top.

Step 5: Adding the Density Layer

p <- p +
  stat_halfeye(
    adjust = 0.5,
    width = 0.6,
    justification = -0.3,
    fill = "skyblue",
    alpha = 0.6
  )
p

stat_halfeye() from the ggdist package draws the density curve — this is the cloud part of the raincloud plot. The adjust parameter controls how smooth the curve looks. The justification value shifts the density to the right side so it does not sit directly on top of the boxplot we will add next.

Looking at this, we can already see that Saturday and Sunday have a wider spread compared to Friday, meaning bills on weekends vary more.

Step 6: Adding the Boxplot Layer

p <- p +
  geom_boxplot(
    width = 0.2,
    outlier.color = "red",
    outlier.shape = 16,
    fill = "lightgreen",
    alpha = 0.7
  )
p

Now we add the boxplot on top of the density. The box shows the interquartile range, the middle line is the median, and the whiskers show the overall spread. The red dots are outliers — bills that are unusually high compared to the rest.

We can see a few red outliers especially on Saturday and Sunday which means some tables had very large bills on those days. The narrow boxplot on Friday tells us spending was more consistent that day.

Step 7: Adding Labels

p <- p + labs(
  title    = "Total Bill Distribution by Day",
  subtitle = "Raincloud Plot (Boxplot + Density)",
  x        = "Day",
  y        = "Total Bill ($)"
)
p

Labels make a plot readable and presentation ready. We add a title that describes what the plot is showing, a subtitle that mentions the technique used, and proper axis labels so anyone looking at it understands what the numbers mean.

A plot without labels is hard to interpret, especially when you are sharing it in a report or presentation.

Step 8: Applying the Theme

p <- p + theme_minimal()
p

theme_minimal() removes the grey background and extra gridlines, giving the plot a clean and simple look. It puts all the focus on the data rather than on decorative elements.

This is the final version of the plot. The sky blue density curve on top and the green boxplot below together form the raincloud shape, and the red outlier points stand out clearly against the clean white background.

Key Takeaways

Looking at the final plot, a few things stand out clearly.

Saturday and Sunday have higher median bills and a wider distribution, which makes sense because weekends tend to have bigger groups and longer meals. Friday has the most compact distribution, meaning people who come on Fridays tend to spend more consistently without much variation.

The outliers on Saturday suggest there were a few tables with very large bills, possibly big group bookings. The density curves also show a slight right skew across all days, which means while most bills are moderate, a small number of customers spend significantly more.

Overall the raincloud plot gave us much more insight than a plain boxplot would have, because we could see both the shape of the data and the summary statistics at the same time.