G CHAITANYA and HARSHA MV
2026-04-14
Problem Statement: Create a raincloud plot
(combining boxplot and density plot) using total_bill as
the numerical variable and day as the
categorical variable to analyze the distribution of
total bill across different days.
We want to understand how the total bill amount varies across different days of the week in a restaurant dataset. Instead of just using a plain boxplot, we are using a raincloud plot which gives a much better picture of the data.
A raincloud plot combines two things — a density curve that shows the shape of the data, and a boxplot that gives the summary statistics. Together they tell a more complete story about the distribution.
The two main variables we are working with are
total_bill which is numeric, and day which is
categorical with four values — Thursday, Friday, Saturday and
Sunday.
We are loading two libraries here. ggplot2 is the main
plotting library in R that lets us build visualizations layer by layer.
ggdist is an extension of ggplot2 that gives us access to
special distribution plots like the half-eye density which we need for
the raincloud effect.
Without ggdist, we cannot create the density part of the
raincloud plot, so both libraries are needed together.
We are reading the tips dataset directly from a URL using
read.csv(). This is a well known dataset that contains
restaurant billing information collected over different days and
times.
The dataset has 244 rows. The column we care about most is
total_bill which stores the bill amount in dollars, and
day which tells us which day the meal happened — Thursday,
Friday, Saturday or Sunday.
## [1] 244 7
## [1] "total_bill" "tip" "sex" "smoker" "day"
## [6] "time" "size"
## total_bill day
## Min. : 3.07 Length:244
## 1st Qu.:13.35 Class :character
## Median :17.80 Mode :character
## Mean :19.79
## 3rd Qu.:24.13
## Max. :50.81
Before we jump into plotting, it is always a good idea to look at the
data first. dim() tells us the size of the dataset,
names() shows all the column names, and
summary() gives us basic statistics like the minimum,
maximum and median values.
From the summary we can already see that the total bill ranges from about 3 dollars to 50 dollars, and Saturday has the most number of entries in the dataset.
Here we are just setting up the base canvas for our plot. We tell
ggplot that day goes on the x-axis and
total_bill goes on the y-axis. Nothing appears on the plot
yet except the axes because we have not added any geometric layers.
This is how ggplot2 works — you build the plot one layer at a time, starting from this base and adding components on top.
p <- p +
stat_halfeye(
adjust = 0.5,
width = 0.6,
justification = -0.3,
fill = "skyblue",
alpha = 0.6
)
pstat_halfeye() from the ggdist package
draws the density curve — this is the cloud part of the raincloud plot.
The adjust parameter controls how smooth the curve looks.
The justification value shifts the density to the right
side so it does not sit directly on top of the boxplot we will add
next.
Looking at this, we can already see that Saturday and Sunday have a wider spread compared to Friday, meaning bills on weekends vary more.
p <- p +
geom_boxplot(
width = 0.2,
outlier.color = "red",
outlier.shape = 16,
fill = "lightgreen",
alpha = 0.7
)
pNow we add the boxplot on top of the density. The box shows the interquartile range, the middle line is the median, and the whiskers show the overall spread. The red dots are outliers — bills that are unusually high compared to the rest.
We can see a few red outliers especially on Saturday and Sunday which means some tables had very large bills on those days. The narrow boxplot on Friday tells us spending was more consistent that day.
p <- p + labs(
title = "Total Bill Distribution by Day",
subtitle = "Raincloud Plot (Boxplot + Density)",
x = "Day",
y = "Total Bill ($)"
)
pLabels make a plot readable and presentation ready. We add a title that describes what the plot is showing, a subtitle that mentions the technique used, and proper axis labels so anyone looking at it understands what the numbers mean.
A plot without labels is hard to interpret, especially when you are sharing it in a report or presentation.
theme_minimal() removes the grey background and extra
gridlines, giving the plot a clean and simple look. It puts all the
focus on the data rather than on decorative elements.
This is the final version of the plot. The sky blue density curve on top and the green boxplot below together form the raincloud shape, and the red outlier points stand out clearly against the clean white background.
Looking at the final plot, a few things stand out clearly.
Saturday and Sunday have higher median bills and a wider distribution, which makes sense because weekends tend to have bigger groups and longer meals. Friday has the most compact distribution, meaning people who come on Fridays tend to spend more consistently without much variation.
The outliers on Saturday suggest there were a few tables with very large bills, possibly big group bookings. The density curves also show a slight right skew across all days, which means while most bills are moderate, a small number of customers spend significantly more.
Overall the raincloud plot gave us much more insight than a plain boxplot would have, because we could see both the shape of the data and the summary statistics at the same time.