library(tidyverse)
#to visualize missing values
library(visdat)
library(patchwork)
#to build treemap
library(treemap)
#to create text grobs
library(grid)
#to stop scientific notation on axes
library(scales)
#to build marginal distribution plots
library(ggExtra)This notebook contains a list of most popular data distribution charts used to analyse a dataset in initial phase followed by a detailed analysis.
x- length of diamond in mm.
y- width of diamond in mm.
z- depth of diamond in mm.
depth- This is actually the depth % of the
diamond.
It is calculated as: \(depth\% = z / mean(x +
y)\). To simplify, \(depth\% = 2z /
(x+y)\).
carat- weight of the diamonds. Determined by x, y
and z values.
clarity- contains 8 ordered levels from “SI2”
(worst) to IF (best); (SI2 contains high impurities). It affects the
brilliance of a diamond. It determines how light gets reflected
internally in a diamond, resulting in the glimmer.
cut 5 categories: Fair < Good < Very Good <
Premium < Ideal.
color- Colourless diamonds are considered better
than the ones with yellowish-brownish tint.
- This dataset contains diamonds of 7 colours- “D” to “J”.
- D, E, F are colorless & G-J have a very faint color.
- The default ordering seemed a bit confusing to me as it goes like D
< E < F < … < J.
Please note- D is the best diamonds in terms of color
and J is the worst, w.r.t to this dataset.
table- the width of the top of diamond.
price- price of diamonds in USD.
The key benchmark descriptors of a diamond are the 4 Cs- carat, color, cut and clarity.
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
carat ranges from 0.20 to 5.01. But 75% of diamonds
lie below 1.04 carats. This implies presence of high outliers & data
seems right-skewed.
cut- “very good”, “premium” and “ideal” diamonds
form roughly 88% (~87.91) of the data with diamonds of ideal cut being
highest in number.
color- Diamonds with color rating “G” are highest in
number with “J”, the lowest quality, being least.
clarity- summary(df1) shows a category
as “(others)” above. Use summary(df1$clarity) to see all
the levels clearly. “others” has masked two levels- I1, IF.
depth featue seems symmetric and tightly
distributed. In other words, most of the datapoints lie close to the
mean/median value of 61.75.
In order to avoid confusion, I’ll rename the column “z” to depth & “depth” to “depth_perc” (depth percentage).
z which reflects depth of a diamond has a minimum
value of 0.000; it seems quite unlikely for a diamond to have absolutely
no depth!
Similarly, x and y too have minimum
value of 0.000mm. Maybe it’s the same rows where z = 0.000. This will be
checked later.
I’ve added 2 methods to see missing values here, one is through code result and the other is grahical.
sum(is.na(df)) counts total number of NA’s in entire
dataframe.
## [1] 0
The method below computes missing values in each column.
## carat cut color clarity depth table price x y z
## [1,] 0 0 0 0 0 0 0 0 0 0
Below, vis_miss() of visdata package has been used to
build a graph of missing values. Since, no missing values present, hence
the entire graph is grey, if missing values were present, we would have
seen some horizontal lines inside each column.
vis_miss() is
ggplot-based, so ggplot2 functions can be added as layers to improve
aesthetics.
visdat::vis_miss(df1)+
theme(axis.title.y = element_text(size = 10),
axis.text = element_text(color = "black", size = 8))## [1] 146
## # A tibble: 146 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.79 Ideal G SI1 62.3 57 2898 5.9 5.85 3.66
## 2 0.79 Ideal G SI1 62.3 57 2898 5.9 5.85 3.66
## 3 0.79 Ideal G SI1 62.3 57 2898 5.9 5.85 3.66
## 4 0.79 Ideal G SI1 62.3 57 2898 5.9 5.85 3.66
## 5 1.52 Good E I1 57.3 58 3105 7.53 7.42 4.28
## 6 1 Fair E SI2 67 53 3136 6.19 6.13 4.13
## 7 1 Fair F SI2 65.1 55 3265 6.26 6.23 4.07
## 8 0.9 Very Good I VS2 58.4 62 3334 6.29 6.35 3.69
## 9 1 Ideal E SI2 62.9 56 3450 6.32 6.3 3.97
## 10 1 Fair H SI1 65.5 57 3511 6.26 6.21 4.08
## # ℹ 136 more rows
It is highly unlikely for diamonds to have exactly all 10 features with same values. While price, cut, clarity etc. can be same, even length, width and depth, table have exactly same measurements. So, I’m removing these duplicate rows.
As mentioned earlier, length,
width, depth have a minimum value of
0.000mm which seems unlikely for any diamond.
## length width depth
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.700 Median : 5.710 Median : 3.530
## Mean : 5.731 Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.030
## Max. :10.740 Max. :58.900 Max. :31.800
Depth percentage is calculated as : \(2*z/(x+y)\).
If depth is 0, then depth% should also be 0.
## # A tibble: 7 × 5
## carat length width depth depth_perc
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.07 0 6.62 0 61.6
## 2 1 0 0 0 63.3
## 3 1.14 0 0 0 57.5
## 4 1.56 0 0 0 62.2
## 5 1.2 0 0 0 62.1
## 6 2.25 0 0 0 62.8
## 7 0.71 0 0 0 64.1
All 7 rows have depth 0 but some value in \(depth\%\) column, which is just not possible!
#checking rows with width 0
df1_unique %>%
subset(width < 0.001,
select = c(carat, length, width, depth,depth_perc))## # A tibble: 6 × 5
## carat length width depth depth_perc
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 0 0 63.3
## 2 1.14 0 0 0 57.5
## 3 1.56 0 0 0 62.2
## 4 1.2 0 0 0 62.1
## 5 2.25 0 0 0 62.8
## 6 0.71 0 0 0 64.1
All these rows with 0 length and width also have depth=0.
#see all rows with depth 0
df1_unique %>%
subset(depth < 0.001,
select = c(carat, length, width, depth, depth_perc))## # A tibble: 19 × 5
## carat length width depth depth_perc
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 6.55 6.48 0 59.1
## 2 1.01 6.66 6.6 0 58.1
## 3 1.1 6.5 6.47 0 63
## 4 1.01 6.5 6.47 0 59.2
## 5 1.5 7.15 7.04 0 64
## 6 1.07 0 6.62 0 61.6
## 7 1 0 0 0 63.3
## 8 1.15 6.88 6.83 0 59.2
## 9 1.14 0 0 0 57.5
## 10 2.18 8.49 8.45 0 59.4
## 11 1.56 0 0 0 62.2
## 12 2.25 8.52 8.42 0 61.3
## 13 1.2 0 0 0 62.1
## 14 2.2 8.42 8.37 0 61.2
## 15 2.25 0 0 0 62.8
## 16 2.02 8.02 7.95 0 62.7
## 17 2.8 8.9 8.85 0 63.8
## 18 0.71 0 0 0 64.1
## 19 1.12 6.71 6.67 0 60.4
Since there are 19 rows with depth=0.000, I’ll start by first removing these rows and see how many length=0 or width=0 get eliminated.
#checking number of rows still left with length=0.000 or width=0.000
df1_unique %>%
subset(width < 0.001 | length < 0.001,
select = c(carat, length, width, depth, depth_perc))## # A tibble: 0 × 5
## # ℹ 5 variables: carat <dbl>, length <dbl>, width <dbl>, depth <dbl>,
## # depth_perc <dbl>
Removing rows with depth=0.000 also removes rows with length=0.000, width=0.000.
Following charts have been created to show the range of options
available to visualize data.
The next section “Detailed Analysis” contains an in-depth analysis of
the dataset.
theme_set(
theme_bw()+
theme(axis.text = element_text(color = "black", size = 9, face = "bold"),
axis.title = element_text(size = 8, face = "bold"),
plot.title = element_text(size = 9, hjust = 0.5),
plot.subtitle = element_text(size = 8, hjust = 0.5)
)
)Barcharts- good to see total number of data
points in each category.
Heatmaps- good to see total count of each sub-category for the main category. In the example below, we can visualize number of diamonds of each color for each type of cut.
Treemaps- can be used to visualize multiple categories within a category. In the e.g. below, the main category is cut, sub-category1 is color and sub-category2 is clarity of diamond. The area of tiles reflects the proportion of data points within each category.
#total count of each diamond-type in dataset
df1_unique %>%
ggplot(aes(x=cut))+
geom_bar(color = "steelblue", fill = "lightblue", width = 0.5)+
geom_text(aes(label = after_stat(count)), stat = "count", size = 2,
fontface = "bold", vjust = -0.5)+
labs(title = "Ideal diamonds are highest in number",
x = "type of cut", y = "total count in the dataset")Such maps can be used to view the ‘group’ with high count/value. For
e.g.the map below shows the relationship between diamond cut and
color.
The color intensity in the tiles reflect number of datapoints with the
corresponding cut and color.
Darker tiles reflect high count.
#group data by cut and color
cut_color_aggr <- df1_unique %>%
group_by(cut, color) %>%
summarise(total_count = n())#heatmap using 2 ordinal variables- color, cut
cut_color_aggr %>%
ggplot(aes(x = cut, y = color, fill = total_count))+
geom_tile(color = "black")+
scale_fill_gradient(low = "#FFE900", high = "#E43D00")+
#remove space between X-Y axis and main plot
scale_y_discrete(expand = c(0,0))+
scale_x_discrete(expand = c(0,0))+
labs(title = "Frequency of diamonds grouped by cut & color ratings",
subtitle = "Dark color reflects higher count")#data prep- aggregate of the 3 Cs- cut, clarity and color
aggr_3c <- df1_unique %>%
group_by(cut, color, clarity) %>%
summarise(total_count = n())#treemap of cut, color, clarity
treemap(aggr_3c,
index = c('cut', 'color', 'clarity'),
vSize = 'total_count',
type = "index",
title = "Number of diamonds by cut, color & clarity",
#increse label size if rectangles are bigger
inflate.labels = F,
bg.labels = "yellow",
#size of labels of each group, sub-group, sub-sub-group
fontsize.labels = c(10,8,6),
#border color for group, sub-group, sub-sub-group
border.col = c("black","black","white")
)#simple histogram of carat distribution
carat_hist <- df1_unique %>%
ggplot(aes(x=carat))+
geom_histogram(binwidth = 0.3, fill = "pink",
color = "#d45087",
#boundary sets the X-labels as start-end values of an interval; and not in center of bar
boundary = 0,
closed = "left")+
scale_y_continuous(expand = expansion(mult = c(0,0.02)))+
scale_x_continuous(breaks = seq(0,5.30,0.30),
labels = seq(0,5.3,0.3))+
labs(title = "Simple histogram",
subtitle = "total diamond on Y-axis")+
theme(axis.text = element_text(size = 7))#Relative frequency histogram with % of data-points on Y-axis.
carat_relative_hist <- df1_unique %>%
ggplot(aes(x=carat))+
geom_histogram(aes(y = after_stat(count)/sum(count)),
binwidth = 0.3,
fill = "pink",
color = "#d45087",
boundary = 0,
closed = "left")+
scale_x_continuous(breaks = seq(0,5.3,.3),
labels = seq(0,5.3,.3))+
scale_y_continuous(expand = expansion(mult = c(0,0.02)))+
labs(title = "Relative frequency histogram",
subtitle = "relative % on Y axis",
y = "freq percentage")+
theme(axis.text = element_text(size = 7))#histogram plots stitched
(carat_hist + carat_relative_hist)+
plot_annotation(title = "Carat distribution: Two types of histograms",
theme = theme(
plot.title = element_text(size = 11, face = "bold", hjust = 0.5)))In order to add labels like “mean”, “median” along the mean/median
lines, first we need to create a text graphical object or ‘Grob’ and
then add it to ggplot using
annotation_custom().
annotate() can also be used here but
annotation_custom() provides more control
over the text aesthetics.
#text graphical object for mean/median labels
mean_label <- grid::textGrob(label = "mean",
rot = 90, y = 0.80,
gp = gpar(fontsize = 7,
fontface = "bold",
col = "darkgreen")
)
median_label <- grid::textGrob(label = "median",
rot = 90, y = 0.90,
gp = gpar(fontsize = 7,
#fontface = "bold",
col = "blue")
)
#mean and median of diamond carats
carat_mean <- mean(df1_unique$carat)
carat_median <- median(df1_unique$carat)df1_unique %>%
ggplot(aes(x=carat))+
geom_histogram(binwidth = 0.3,
fill = "pink",
color = "#d45087",
boundary = 0,
closed = "left")+
#mean line
geom_vline(aes(xintercept = mean(carat)), color = "darkgreen", linetype = "dashed")+
#label mean
annotation_custom(mean_label, xmin = carat_mean+0.05, xmax = carat_mean+0.05)+
#median line
geom_vline(xintercept = carat_median, color = "blue",
linetype = "dashed", linewidth = 0.5)+
#label- median
annotation_custom(median_label, xmin = carat_median-0.05,
xmax = carat_median-0.05)+
scale_x_continuous(breaks = seq(0,5.3,.3),
labels = seq(0,5.3,.3))+
scale_y_continuous(expand = expansion(mult = c(0,0.02)))+
labs(title = "Carat Distribution")df1_unique %>%
ggplot(aes(x=carat, fill = cut))+
geom_density(outline.type = "full", alpha = 0.9,
color = "black")+
scale_y_continuous(expand = expansion(mult = c(0,0.02)))+
labs(title = "Density Plot for carats")#frequency polygon for diamond carats- y axis: total count
count_f_poly <- df1_unique %>%
ggplot(aes(x=price, color = cut, fill = cut))+
geom_freqpoly(binwidth=500)+
scale_y_continuous(expand = expansion(mult = c(0,0.02)))+
scale_x_continuous(breaks = seq(0,20000,3000),
labels = seq(0,20000,3000))+
labs(title = "Frequency Polygon for diamonds price",
subtitle = "Y-axis: total count (default)",
x = "price (in USD)")+
theme(legend.position = "none")
#Y-axis is density, not diamonds count
density_f_poly <- df1_unique %>%
ggplot(aes(x=price, color = cut, fill = cut))+
geom_freqpoly(aes(y=after_stat(density)),binwidth=500)+
scale_y_continuous(expand = expansion(mult = c(0,0.02)), labels = scales::comma)+
scale_x_continuous(breaks = seq(0,20000,3000),
labels = seq(0,20000,3000))+
labs(title = "Frequency Polygon for diamonds price",
subtitle = "Y-axis: density",
x = "price (in USD)")df1_unique %>%
ggplot(aes(x=cut, y = carat, fill = cut))+
#add bars at the end of whiskers
stat_boxplot(geom = "errorbar", width = 0.3)+
geom_boxplot(outlier.color = "red", outlier.size = 1)+
labs(title = "Boxplots for carat distribution")This is a combination of boxplot and density plot.
#violing plot is a combination of box and density plot
df1_unique %>%
ggplot(aes(x=cut, y = carat, fill = cut))+
#trim=F prevents trimming of violin ends,
#displays violin for full range of carat
geom_violin(trim = F, draw_quantiles = c(0.25,0.5,0.75))+
labs(title = "Violin plot for carat distribution")Rug plots are not separate plots; they just highlight each data point by creating a line for each data point along the axes.
df1_unique %>%
filter(cut == "Ideal", color == "D",
clarity %in% c("IF","VVS1")) %>%
ggplot(aes(x=carat, y=price,color = clarity))+
geom_point()+
geom_rug()+
scale_color_manual(values = c("IF" = "cyan4", "VVS1" = "darkmagenta"))+
labs(title = "Rug plot + Scatter plot",
subtitle = "Ideal diamonds with color rating 'D'")Marginal plots helps us visualize the distribution intensity at
different values of the variable. ggExtra
package has been used.
#scatterplot with histogram highlighting marginal distribution
scatter_plot <- df1_unique %>%
filter(cut %in% c("Ideal", "Premium")) %>%
ggplot(aes(x=carat, y = price))+
geom_jitter(color = "darkmagenta")+
labs(title = "Marginal Distribution through density plot",
subtitle = "75% of diamonds have price below $5369, hence the histogram is skewed")
ggExtra::ggMarginal(scatter_plot, type = "histogram",
fill = "darkmagenta", margins = 'y')##individual plots
#total count of each diamond-type in dataset
cut_count <- df1_unique %>%
ggplot(aes(x=cut))+
geom_bar(color = "steelblue", fill = "lightblue", width = 0.5)+
geom_text(aes(label = paste0(round((after_stat(count)/sum(count))*100,2),"%")),
stat = "count", size = 2,
fontface = "bold", vjust = -0.5)+
scale_y_continuous(expand = expansion(mult = c(0,0.07)))+
coord_cartesian(clip = "off")+
labs(title = "Ideal diamonds are highest in number",
x = "type of cut", y = "total count in the dataset")
#total count of diamonds of each color type
color_count <- df1_unique %>%
ggplot(aes(x=color))+
geom_bar(color = "steelblue", fill = "lightblue", width = 0.5)+
geom_text(aes(label = paste0(round((after_stat(count)/sum(count))*100,2),"%")),
stat = "count", size = 2,
fontface = "bold", vjust = -0.5)+
scale_y_continuous(expand = expansion(mult = c(0,0.07)))+
coord_cartesian(clip = "off")+
labs(title = "Diamonds of color E,F,G form 56.8% of the dataset",
x = "color of diamond", y = "")
#total count of diamonds of each clarity type
clarity_count <- df1_unique %>%
ggplot(aes(x=clarity))+
geom_bar(color = "steelblue", fill = "lightblue", width = 0.5)+
geom_text(aes(label = paste0(round((after_stat(count)/sum(count))*100,2),"%")),
stat = "count", size = 2,
fontface = "bold", vjust = -0.5)+
scale_y_continuous(expand = expansion(mult = c(0,0.07)))+
coord_cartesian(clip = "off")+
labs(title = "Diamond clarity",
x = "diamond clarity", y = "")Observations:
The dataset contains super-quality diamonds in terms of their cut; ‘Premium’ & ‘Ideal’ diamonds are most abundant.
Diamonds with color rating ‘G’ are highest in number followed by
‘E’. D-F color rating diamonds are colorless, hence of better standards.
G-J rated diamonds have a yellowish tint as we progress from G < H
< I < J.
‘G’ diamonds lie on the margin of these 2 groups and they are most
abundant in the dataset.
“IF”, ‘Internally Flawless’ diamonds are the best in terms of clarity and they have the second lowest count in our dataset.
The 4Cs- carat, cut, color, clarity are considered the benchmarks for assessing diamonds and in upcoming section, we’ll see how each of these factors affect the diamond price individually and together.
cut_color_heatmap <- df1_unique %>%
group_by(cut, color) %>%
summarize(total_count = n()) %>%
ggplot(aes(x=cut, y = color, fill = total_count))+
geom_tile(color = "black")+
scale_x_discrete(expand = c(0,0))+
scale_y_discrete(expand = c(0,0))+
scale_fill_gradient(low = "yellow",
high = "firebrick",
guide = guide_colorbar(title = "diamonds count", title.theme = element_text(size = 7, face = "bold", vjust = 2)))+
labs(title = "Ideal diamonds with color rating 'G' are most abundant")
#cut and clarity
cut_clarity_heatmap <- df1_unique %>%
group_by(cut, clarity) %>%
summarize(total_count = n()) %>%
ggplot(aes(x=cut, y = clarity, fill = total_count))+
geom_tile(color = "black")+
scale_x_discrete(expand = c(0,0))+
scale_y_discrete(expand = c(0,0))+
scale_fill_gradient(
low = "yellow",
high = "firebrick",
guide = guide_colorbar(title = "diamonds count", title.theme = element_text(size = 7, face = "bold", vjust = 2)))+
labs(title = "Ideal diamonds with VS2 clarity are most abundant")
cut_color_heatmap + cut_clarity_heatmapdf1_unique %>%
group_by(color, clarity) %>%
summarize(count = n()) %>%
ggplot(aes(x=color, y=clarity, fill=count))+
geom_tile(color = "black")+
scale_x_discrete(expand = c(0,0))+
scale_y_discrete(expand = c(0,0))+
scale_fill_gradient(low = "yellow", high = "firebrick",
guide = guide_colorbar(title = "diamonds count", title.theme = element_text(size = 8, face = "bold")))+
labs(title = "Majority of diamonds with color rating 'E' have clarity SI1 or VS2")#grouped by cut, clarity and color
df1_unique %>%
group_by(cut, color, clarity) %>%
summarize(total_count = n()) %>%
ggplot(aes(x=cut, y= total_count, fill = cut))+
geom_bar(stat = "identity")+
scale_y_continuous(expand = expansion(mult = c(0,0.05)))+
facet_grid(clarity~color)+
theme(
strip.text.x = element_text(size = 5, margin = margin(t=0.4,b=0.4), face = "bold"),
strip.text.y = element_text(size = 5,face = "bold", margin = margin(r=0.4,l=0.4)),
axis.text = element_text(size = 6),
axis.text.x = element_text(angle = 90)
)+
labs(title = "Diamond frequency based on 3Cs",
y="")Observations
cut_price <- df1_unique %>%
ggplot(aes(x=cut, fill = cut, y = price))+
stat_boxplot(geom = "errorbar", width = 0.3)+
geom_boxplot(color = "black", outlier.colour = "red",
outlier.size = 0.3)+
labs(title = "Price distribution of diamonds based on cut")+
theme(legend.position = "none")
#carat-cut boxplot
cut_carat <- df1_unique %>%
ggplot(aes(x=cut, y = carat, fill = cut))+
stat_boxplot(geom = "errorbar", width = 0.3)+
geom_boxplot(color = "black", outlier.colour = "red", outlier.size = 0.6)+
labs(title = "Carat distribution with respect to cut")+
theme(legend.position = "none")
cut_price + cut_caratObservation:
## [1] "Carat percentile of Ideal diamonds"
## 0% 25% 50% 75% 100%
## 0.20 0.35 0.54 1.01 3.50
#boxplot of color and price of diamond
color_price <- df1_unique %>%
ggplot(aes(x=color, y = price, fill = color))+
stat_boxplot(geom = "errorbar", width = 0.3)+
geom_boxplot(color = "black", outlier.colour = "red",
outlier.size = 0.4)+
labs(title = "Price distribution with respect to diamond color")+
theme(legend.position = "none")
#boxplot of color and carat
color_carat <- df1_unique %>%
ggplot(aes(x=color, y = carat, fill = color))+
stat_boxplot(geom = "errorbar", width = 0.3)+
geom_boxplot(color = "black", outlier.colour = "red",
outlier.size = 0.4)+
labs(title = "Carat distribution with respect to diamond color")+
theme(legend.position = "none")
#stitch both graphs
color_price + color_caratObservations:
## [1] "Carat percentile of D-rated diamonds"
## 0% 25% 50% 75% 90% 99%
## 0.20 0.36 0.53 0.91 1.10 1.75
## [1] "Carat percentile of J-rated diamonds"
## 0% 25% 50% 75% 90% 99%
## 0.23 0.71 1.11 1.52 2.03 2.51
We can observe from these readings that carat of a diamond seems to play a crucial role in price determination even though color, cut is of highest quality.
#boxplot for price of diamone and their clarity
clarity_price <- df1_unique %>%
ggplot(aes(x=clarity, y = price, fill = clarity))+
stat_boxplot(geom = "errorbar", width = 0.3)+
geom_boxplot(color = "black", outlier.colour = "red", outlier.size = 0.4)+
labs(title = "price distribution with respect to clarity")+
theme(legend.position = "none")
#boxplot for clarity and carat
clarity_carat <- df1_unique %>%
ggplot(aes(x=clarity, y = carat, fill = clarity))+
stat_boxplot(geom = "errorbar", width = 0.3)+
geom_boxplot(color = "black", outlier.colour = "red", outlier.size = 0.4)+
labs(title = "carat distribution with respect to clarity")+
theme(legend.position = "none")
#stitch graphs
clarity_price + clarity_caratObservation
Even though ‘IF’ and ‘VVS1’ are diamonds with maximum brilliance, their median price is less than diamonds with low clarity rating!
So, to summarize:
(cut_scatter <- df1_unique %>%
ggplot(aes(x=carat, y=price, color = cut))+
geom_point()+
#scale_y_continuous(expand = expansion(mult = c(0,0.05)))+
labs(title = "As carat increases, prices tend to go up",
y = "price in USD"))The correlation matrix helps to quantify the dependency between all
relevant features.
Since, depth_perc is calculated using length, width & depth values,
these 3 features have been excluded.
#select relevant features
corr_dataset <- df1_unique %>%
select(cut, color, clarity,depth_perc,carat, price)There are 3 columns of type ‘factor’ in our data- color, cut
& clarity.
Color and Cut have levels in ascending order of the quality. They are ordered from worst to best. For e.g. Quality of diamond cut- “Fair” has level 1 < “Good” is level 2, and so on. Similarly, for clarity- “I1” is 1, …,“IF” is 8, which is the best.
This is not the case with colors column. It is ordered as “D” (1) < “E” (2) < … < “J” (7). D is best rating but it has level of 1.
I’ve changed the order from worst to best for color column below, i.e. “J” gets level 1, “D” gets level 7.
corr_dataset$color <- factor(corr_dataset$color,
levels = c("J", "I", "H", "G", "F", "E", "D"), ordered = T)
##checking the order
#levels(corr_dataset$color)[7]# Convert the factor to numeric
#correlation matrix can be created with numeric values only
#this replaces values like "ideal" with their ordering number.
corr_dataset$color <- as.numeric(corr_dataset$color)
corr_dataset$clarity <- as.numeric(corr_dataset$clarity)
corr_dataset$cut <- as.numeric(corr_dataset$cut)## cut color clarity depth_perc carat
## cut 1.00000000 0.02014858 0.18833670 -0.21713549 -0.13335113
## color 0.02014858 1.00000000 -0.02570228 -0.04765400 -0.29094288
## clarity 0.18833670 -0.02570228 1.00000000 -0.06740010 -0.35219384
## depth_perc -0.21713549 -0.04765400 -0.06740010 1.00000000 0.02788878
## carat -0.13335113 -0.29094288 -0.35219384 0.02788878 1.00000000
## price -0.05222510 -0.17174629 -0.14612430 -0.01114383 0.92154847
## price
## cut -0.05222510
## color -0.17174629
## clarity -0.14612430
## depth_perc -0.01114383
## carat 0.92154847
## price 1.00000000