Raincloud Plot Analysis

Author

G CHAITANYA and HARSHA MV

Problem Statement: Create a raincloud plot (combining boxplot and density plot) using total_bill as the numerical variable and day as the categorical variable to analyze the distribution of total bill across different days.

Step 1: Load the Required Libraries

  • Load ggplot2 and ggdist libraries.
  • ggplot2: Needed to create box plots and other visualizations.
  • ggdist: for density (raincloud effect)
library(ggplot2)
library(ggdist)   # for density (raincloud effect)
Warning: package 'ggdist' was built under R version 4.5.3

Step 2: Load CSV Dataset

  • Load the CSV dataset.
  • read.csv() → loads your downloaded dataset Dataset Info:
  • Contains restaurant billing data Important columns:
  • total_bill → numeric (bill amount)
  • day → categorical (Thur, Fri, Sat, Sun)
data <- read.csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")

Step 3 : Explore the Dataset

Before performing any analysis,we should understand the dataset.

We will check:

  • Number of rows and columns
  • Column names
  • Data types
  • Summary statistics
  • First few rows
# Dimensions (rows and columns)
dim(data)
[1] 244   7
# Column names
names(data)
[1] "total_bill" "tip"        "sex"        "smoker"     "day"       
[6] "time"       "size"      
#Structure of dataset
str(data)
'data.frame':   244 obs. of  7 variables:
 $ total_bill: num  17 10.3 21 23.7 24.6 ...
 $ tip       : num  1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
 $ sex       : chr  "Female" "Male" "Male" "Male" ...
 $ smoker    : chr  "No" "No" "No" "No" ...
 $ day       : chr  "Sun" "Sun" "Sun" "Sun" ...
 $ time      : chr  "Dinner" "Dinner" "Dinner" "Dinner" ...
 $ size      : int  2 3 3 2 4 4 2 4 2 2 ...
#Summary statistics
summary(data)
   total_bill         tip             sex               smoker         
 Min.   : 3.07   Min.   : 1.000   Length:244         Length:244        
 1st Qu.:13.35   1st Qu.: 2.000   Class :character   Class :character  
 Median :17.80   Median : 2.900   Mode  :character   Mode  :character  
 Mean   :19.79   Mean   : 2.998                                        
 3rd Qu.:24.13   3rd Qu.: 3.562                                        
 Max.   :50.81   Max.   :10.000                                        
     day                time                size     
 Length:244         Length:244         Min.   :1.00  
 Class :character   Class :character   1st Qu.:2.00  
 Mode  :character   Mode  :character   Median :2.00  
                                       Mean   :2.57  
                                       3rd Qu.:3.00  
                                       Max.   :6.00  
#First six row
head(data)
  total_bill  tip    sex smoker day   time size
1      16.99 1.01 Female     No Sun Dinner    2
2      10.34 1.66   Male     No Sun Dinner    3
3      21.01 3.50   Male     No Sun Dinner    3
4      23.68 3.31   Male     No Sun Dinner    2
5      24.59 3.61 Female     No Sun Dinner    4
6      25.29 4.71   Male     No Sun Dinner    4

Step 4: Initialize Plot

p <- ggplot(data, aes(x = day, y = total_bill))
p

  • Initializes plot object p Defines:
  • x = day (categorical variable)
  • y = total_bill (numeric variable)
  • No plot appers yet, it just defines the outline of the plot

Step 5: Add Raincloud Components

###what is Raincloud Plot? - It is the combination of Density and boxplot. In simple words, Raincloud Plot = Density + Boxplot.

p <- p +
  stat_halfeye(
    adjust = 0.5,
    width = 0.6,
    justification = -0.3,
    fill = "skyblue",
    alpha = 0.6
  ) +
  geom_boxplot(
    width = 0.2,
    outlier.color = "red",
    outlier.shape = 16,
    fill = "lightgreen",
    alpha = 0.7
  )
p

  • In this step, we have combined both density plot and boxplot to form a raincloud plot.
  • The function stat_halfeye() is used to create the density part of the plot.
  • It shows the distribution (spread) of total bill values for each day.
  • The adjust parameter controls how smooth the density curve looks.
  • The justification parameter shifts the density slightly to the side so it does not overlap with the boxplot.
  • The function geom_boxplot() is used to create the boxplot.
  • It displays important statistical values such as:
  •   Median (middle line)
  •   Interquartile Range (the box)
  •   Whiskers (range of data)
  •   Outliers (extreme values)
  • The outliers are highlighted using red color and solid circular points.
  • The alpha parameter is used to control the transparency of the plot.
  • This step improves the visualization by combining:
  •   Shape of distribution (density)
  •   Statistical summary (boxplot)

Step 6:Add Labels

  • Adding descriptive labels to the plot.
  • Makes plot readable and presentation-ready.
  • Important in reports and dashboards.
p <- p + labs(
  title = "Total Bill Distribution by Day",
  subtitle = "Raincloud Plot (Boxplot + Density)",
  x = "Day",
  y = "Total Bill"
)
p

Step 7:Add Minimal theme

  • Applying a minimal theme (clean look).
  • Removes unnecessary background elements.
  • Focuses attention on data.
p <- p + theme_minimal()
p