: Explore dataframe diamonds and create 5
visualizations
Diamond_ring
The dataset diamonds is a built-in dataset in
ggplot2 library which containing the prices and other
attributes of almost 54,000 round cut diamonds.
The variable in diamonds(53,940 rows x 10
columns) are as follows:
Objective Let’s say I’m looking for an engagement ring (for myself). I have done some research in these diamond attributes and come across some questions:
price in round cut diamonds
?table should be fairly large.
Therefore, I want to compare the tablevalues between
cut gradeclarity, let’s explore
this variable by comparing each clarity groupColor and carat also play a major role in
diamond price. What is the correlation between these two
variables with respect to clarity?color and clarity?# Import Library
library(tidyverse)
library(dplyr)
library(patchwork)
library(scales)
# loading dataset
data(diamonds)
Review dataset
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Review data structure
glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
To gain a better understanding in this dataset, here’s some study that I did on diamond grading system and its value (in this case – price). The grading system is based on four attributes: carat, color, clarity, and cut, known as The Four Cs of Diamonds.
Carat
Diamond weight is measured in carats. One carat (or 1 ct, when abbreviated) equals 0.2 gram. The carat weight is sometimes directly related to price and the prices go up as diamond carat weight increase.
Color
Diamonds are graded on a color scale from D-Z. According to GIA color grading scale, a diamond graded D-F is considered to be “colorless” and diamond graded G-J are considered to be “near colorless”.
GIA color chart
Clarity
Diamonds clarity are graded on a scale from “Internal Flawless (IF)” to “Included (I3)”. Most diamonds have inclusions (defined as anything that will interfere with the free passage of light) but the less inclusions it has, the more valuable it is.
Clarity grading are as follows:
Diamond Clarity Chart
Cut
Diamond cut quality is what gives a diamond its sparkle. Therefore, Cut is considered to be the most important determining the diamond’s overall beauty. The overall quality of the cut is determined by the anatomy of diamond such as table, total depth percentage, etc. as well as the quality of the polish.
For round brilliant diamonds, cut grading are as follows:
To simply explore the distribution of diamonds price, the histogram along with boxplot is plotted as shown below:
# Using sampling method
set.seed(42)
p1 <- ggplot(diamonds %>% sample_n(5400),
aes(price)) +
geom_histogram(bins=10, fill = "#b0c4b1", alpha = 0.5, color = "#4a5759" ) +
theme_minimal() +
ggtitle("Histogram")
p2 <- ggplot(diamonds %>% sample_n(5400),
aes(price)) +
geom_boxplot(fill = "#b0c4b1", alpha = 0.5, color = "#4a5759",
outlier.colour = "#bc4b51", outlier.shape = 1) +
theme_minimal() +
ggtitle("Boxplot")
patchwork <- p1 / p2
patchwork +
plot_annotation(
title = "The Distribution of Diamonds Prices",
subtitle = 'The "prices" distribution is positive skew with outliers (red marks)',
caption = 'Source: Diamonds from ggplot2'
)
As table is one of the attribute that affect the
cut quality. I want to compare the distribution of
table between cut grade side-by-side. The
boxplot of table is created as shown below:
diamonds %>%
ggplot(aes( cut ,table,fill=cut, col=cut)) +
#geom_point(position = "jitter") +
geom_jitter(alpha=0.1, show.legend = FALSE) +
geom_boxplot(alpha=0.9, fill= NA, color = "#463f3a", show.legend = FALSE)+
theme_minimal() +
scale_color_manual(values = c(
"#ffcdb2", "#ffb4a2", "#e5989b", "#b5838d", "#9e2a2b")) +
labs(title = "Boxplot of Table by Cut Grade",
subtitle = "Table: width of top of diamond relative to widest point",
x = "Cut Grade",
y = "Table",
caption = "Source: Diamonds from ggplot2")
As in this dataset, clarity is scaling from IF to I1,
and the flaw is usually noticeable in SI1 to I1. Therefore, instead of
showing all scaling percentage in pie chart, the clarity is
grouped into 4 segments as follows:
In this case, I can focus more on comparing the clean and unnoticeable grade.
## create new segment level
# data transformation
dia_cl <- diamonds %>%
mutate(segment="")
dia_cl$segment[dia_cl$clarity == "IF"] <- "Internal Flawless (IF)"
dia_cl$segment[dia_cl$clarity %in% c("VVS1","VVS2")] <- "Very, Very Slightly Included (VVS1-2)"
dia_cl$segment[dia_cl$clarity %in% c("VS1","VS2")] <- "Very Slightly Included (VS1-2)"
dia_cl$segment[dia_cl$clarity %in% c("SI1","SI2", "I1")] <- "All Other (SI1 to I1)"
dia_cl$segment <- factor(dia_cl$segment, levels = c("Internal Flawless (IF)", "Very, Very Slightly Included (VVS1-2)","Very Slightly Included (VS1-2)","All Other (SI1 to I1)"))
dia_cl <- dia_cl %>%
group_by(segment) %>%
count() %>%
ungroup() %>%
mutate(pc =`n`/sum(`n`)) %>%
arrange(pc ) %>%
mutate(labels = scales::percent(pc, accuracy = 0.1))
dia_cl
## # A tibble: 4 × 4
## segment n pc labels
## <fct> <int> <dbl> <chr>
## 1 Internal Flawless (IF) 1790 0.0332 3.3%
## 2 Very, Very Slightly Included (VVS1-2) 8721 0.162 16.2%
## 3 Very Slightly Included (VS1-2) 20429 0.379 37.9%
## 4 All Other (SI1 to I1) 23000 0.426 42.6%
Create Pie Chart
# build pie chart
ggplot(dia_cl, aes(x="", y = pc, fill= segment)) +
geom_col(alpha=0.8) +
geom_text(aes(x = 1.65, label = labels),
position = position_stack(vjust = 0.5),
size = 3)+
coord_polar(theta = "y") +
theme_void() + #empty theme
scale_fill_manual(values=c("#8338ec","#e29578","#ccc5b9","#252422")) +
guides(fill = guide_legend(title = "Clarity Segments")) +
labs(title = "Percentage of diamonds by Clarity Segments",
subtitle = "Total n = 53,940",
caption = "Source: Diamonds from ggplot2")
As diamond prices depend on various variables. I want to
explore the correlation between carat and
price in colorless diamonds (D, E, and F), focusing solely
on clean & unnoticeable flaw clarity grades (IF to VS2).
Scatter plots with trendlines are used to illustrate these correlations.
## Data Transformation
colorless_dia <- diamonds %>%
filter(color %in% c("D","E","F") & cut == "Ideal"
& clarity %in% c("IF","VVS1","VVS2","VS1","VS2"))
# summary(colorless_dia)
## Scatterplot
ggplot(colorless_dia, aes(x=carat, y=price, col = color) ) +
geom_point(size=1, position = "jitter", alpha=0.4) +
geom_smooth(method="lm", aes(color=color),
formula = (y ~ exp(x)),
se = FALSE) +
scale_x_continuous(limits =c(0.20,2.00) , n.breaks = 10) +
scale_y_continuous(limits = c(500,20000), n.breaks = 10) +
theme_minimal() +
scale_color_manual( values = c("#4575b4", "#91cf60", "#fc8d59")) +
facet_wrap(~ clarity, ncol = 2) +
labs(title = "Correlation between Carat and Price in Colorless Diamonds",x = "Carat", y = "Price (USD)",
subtitle = "Cut: Ideal grade, Color: D-E-F, Clarity: IF to VS2 ",
caption = "Source: Diamonds from ggplot2")
Assume the engagement ring that I’m looking for is 1.00-1.25 carat
diamond with Ideal-cut. Before making decision, I want to know the
average price of ideal cut diamond based on
color and clarity.
colorless_dia %>%
filter(carat >= 1 & carat <= 1.25) %>%
group_by(color,clarity) %>%
summarise(avg_price = round(mean(price))) %>%
ggplot( aes(x=color, y=avg_price, fill=clarity)) +
geom_bar(alpha = 0.8,stat="identity", position=position_dodge())+
geom_text(aes(label=avg_price), vjust=1.6, color="black",
position = position_dodge(0.9), size=2)+
scale_fill_brewer(palette="BuPu", direction = 1) +
theme_minimal() +
scale_y_continuous(limits = c(0,18000), n.breaks = 8) +
labs(title = "Average Price of Colorless Diamonds by Clarity",
subtitle = "Cut: Ideal grade and Carat: 1.00 to 1.25",
x = "Color",
y = "Average Price (USD)",
caption = "Source: Diamonds from ggplot2")