library(ggplot2)
library(ggdist) # for density (raincloud effect)Warning: package 'ggdist' was built under R version 4.5.3
Problem Statement: Create a raincloud plot (combining boxplot and density plot) using total_bill as the numerical variable and day as the categorical variable to analyze the distribution of total bill across different days.
ggplot2 and ggdist libraries.ggplot2: Needed to create box plots and other visualizations.ggdist: for density (raincloud effect)library(ggplot2)
library(ggdist) # for density (raincloud effect)Warning: package 'ggdist' was built under R version 4.5.3
data <- read.csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")Before performing any analysis,we should understand the dataset.
We will check:
# Dimensions (rows and columns)
dim(data)[1] 244 7
# Column names
names(data)[1] "total_bill" "tip" "sex" "smoker" "day"
[6] "time" "size"
#Structure of dataset
str(data)'data.frame': 244 obs. of 7 variables:
$ total_bill: num 17 10.3 21 23.7 24.6 ...
$ tip : num 1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
$ sex : chr "Female" "Male" "Male" "Male" ...
$ smoker : chr "No" "No" "No" "No" ...
$ day : chr "Sun" "Sun" "Sun" "Sun" ...
$ time : chr "Dinner" "Dinner" "Dinner" "Dinner" ...
$ size : int 2 3 3 2 4 4 2 4 2 2 ...
#Summary statistics
summary(data) total_bill tip sex smoker
Min. : 3.07 Min. : 1.000 Length:244 Length:244
1st Qu.:13.35 1st Qu.: 2.000 Class :character Class :character
Median :17.80 Median : 2.900 Mode :character Mode :character
Mean :19.79 Mean : 2.998
3rd Qu.:24.13 3rd Qu.: 3.562
Max. :50.81 Max. :10.000
day time size
Length:244 Length:244 Min. :1.00
Class :character Class :character 1st Qu.:2.00
Mode :character Mode :character Median :2.00
Mean :2.57
3rd Qu.:3.00
Max. :6.00
#First six row
head(data) total_bill tip sex smoker day time size
1 16.99 1.01 Female No Sun Dinner 2
2 10.34 1.66 Male No Sun Dinner 3
3 21.01 3.50 Male No Sun Dinner 3
4 23.68 3.31 Male No Sun Dinner 2
5 24.59 3.61 Female No Sun Dinner 4
6 25.29 4.71 Male No Sun Dinner 4
p <- ggplot(data, aes(x = day, y = total_bill))
pday (categorical variable)total_bill (numeric variable)###what is Raincloud Plot? - It is the combination of Density and boxplot. In simple words, Raincloud Plot = Density + Boxplot.
p <- p +
stat_halfeye(
adjust = 0.5,
width = 0.6,
justification = -0.3,
fill = "skyblue",
alpha = 0.6
) +
geom_boxplot(
width = 0.2,
outlier.color = "red",
outlier.shape = 16,
fill = "lightgreen",
alpha = 0.7
)
pstat_halfeye() is used to create the density part of the plot.adjust parameter controls how smooth the density curve looks.justification parameter shifts the density slightly to the side so it does not overlap with the boxplot.geom_boxplot() is used to create the boxplot. Median (middle line) Interquartile Range (the box) Whiskers (range of data) Outliers (extreme values) Shape of distribution (density) Statistical summary (boxplot)p <- p + labs(
title = "Total Bill Distribution by Day",
subtitle = "Raincloud Plot (Boxplot + Density)",
x = "Day",
y = "Total Bill"
)
pp <- p + theme_minimal()
p