Diamonds data - Descriptive Statistics

library(tidyverse)
#to visualize missing values
library(visdat)
library(patchwork)
#to build treemap
library(treemap)
#to create text grobs
library(grid)
#to stop scientific notation on axes
library(scales)
#to build marginal distribution plots
library(ggExtra)

This notebook contains a list of most popular data distribution charts used to analyse a dataset in initial phase followed by a detailed analysis.

Data description

x- length of diamond in mm.
y- width of diamond in mm.
z- depth of diamond in mm.
depth- This is actually the depth % of the diamond.
It is calculated as: $depth\% = z / mean(x + y)$. To simplify, $depth\% = 2z / (x+y)$.
carat- weight of the diamonds. Determined by x, y and z values.
clarity- contains 8 ordered levels from “SI2” (worst) to IF (best); (SI2 contains high impurities). It affects the brilliance of a diamond. It determines how light gets reflected internally in a diamond, resulting in the glimmer.
cut 5 categories: Fair < Good < Very Good < Premium < Ideal.
color- Colourless diamonds are considered better than the ones with yellowish-brownish tint.
- This dataset contains diamonds of 7 colours- “D” to “J”.
- D, E, F are colorless & G-J have a very faint color.
- The default ordering seemed a bit confusing to me as it goes like D < E < F < … < J.
Please note- D is the best diamonds in terms of color and J is the worst, w.r.t to this dataset.
table- the width of the top of diamond.
price- price of diamonds in USD.

The key benchmark descriptors of a diamond are the 4 Cs- carat, color, cut and clarity.

df1 <- diamonds
#quick view
glimpse(df1)

## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

#check summary
summary(df1)

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
##

Initial Observations from summary stats:

carat ranges from 0.20 to 5.01. But 75% of diamonds lie below 1.04 carats. This implies presence of high outliers & data seems right-skewed.
cut- “very good”, “premium” and “ideal” diamonds form roughly 88% (~87.91) of the data with diamonds of ideal cut being highest in number.
color- Diamonds with color rating “G” are highest in number with “J”, the lowest quality, being least.
clarity- summary(df1) shows a category as “(others)” above. Use summary(df1$clarity) to see all the levels clearly. “others” has masked two levels- I1, IF.
depth featue seems symmetric and tightly distributed. In other words, most of the datapoints lie close to the mean/median value of 61.75.
In order to avoid confusion, I’ll rename the column “z” to depth & “depth” to “depth_perc” (depth percentage).
z which reflects depth of a diamond has a minimum value of 0.000; it seems quite unlikely for a diamond to have absolutely no depth!
Similarly, x and y too have minimum value of 0.000mm. Maybe it’s the same rows where z = 0.000. This will be checked later.

I. Data Cleaning

1. Missing Values- none

I’ve added 2 methods to see missing values here, one is through code result and the other is grahical.

a) Usual method of checking missing values

sum(is.na(df)) counts total number of NA’s in entire dataframe.

#count NAs in entire dataframe
sum(is.na(df1))

## [1] 0

The method below computes missing values in each column.

#NAs in each row
rbind(colSums(is.na(df1)))

##      carat cut color clarity depth table price x y z
## [1,]     0   0     0       0     0     0     0 0 0 0

b) Graph of Missing values

Below, vis_miss() of visdata package has been used to build a graph of missing values. Since, no missing values present, hence the entire graph is grey, if missing values were present, we would have seen some horizontal lines inside each column.

vis_miss() is ggplot-based, so ggplot2 functions can be added as layers to improve aesthetics.

visdat::vis_miss(df1)+
  theme(axis.title.y = element_text(size = 10),
        axis.text = element_text(color = "black", size = 8))

2. Duplicate Rows- found.

#total number of duplicate rows
sum(duplicated(df1))

## [1] 146

#view all duplicate rows
df1[duplicated(df1),]

## # A tibble: 146 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.79 Ideal     G     SI1      62.3    57  2898  5.9   5.85  3.66
##  2  0.79 Ideal     G     SI1      62.3    57  2898  5.9   5.85  3.66
##  3  0.79 Ideal     G     SI1      62.3    57  2898  5.9   5.85  3.66
##  4  0.79 Ideal     G     SI1      62.3    57  2898  5.9   5.85  3.66
##  5  1.52 Good      E     I1       57.3    58  3105  7.53  7.42  4.28
##  6  1    Fair      E     SI2      67      53  3136  6.19  6.13  4.13
##  7  1    Fair      F     SI2      65.1    55  3265  6.26  6.23  4.07
##  8  0.9  Very Good I     VS2      58.4    62  3334  6.29  6.35  3.69
##  9  1    Ideal     E     SI2      62.9    56  3450  6.32  6.3   3.97
## 10  1    Fair      H     SI1      65.5    57  3511  6.26  6.21  4.08
## # ℹ 136 more rows

It is highly unlikely for diamonds to have exactly all 10 features with same values. While price, cut, clarity etc. can be same, even length, width and depth, table have exactly same measurements. So, I’m removing these duplicate rows.

#remove duplicate rows
df1_unique <- df1 %>% distinct()

3. Rename columns

#renaming some columns to avoid confusion
df1_unique <- df1_unique %>% 
                rename(length = x,
                       width = y,
                       depth = z,
                       depth_perc = depth)

4. Check for unsual values

As mentioned earlier, length, width, depth have a minimum value of 0.000mm which seems unlikely for any diamond.

summary(df1_unique[,c('length', 'width', 'depth')])

##      length           width            depth       
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.700   Median : 5.710   Median : 3.530  
##  Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.030  
##  Max.   :10.740   Max.   :58.900   Max.   :31.800

Depth percentage is calculated as : $2*z/(x+y)$.
If depth is 0, then depth% should also be 0.

a) Length column

df1_unique %>% subset(length < 0.001, select = c(carat, length, width,depth, depth_perc))

## # A tibble: 7 × 5
##   carat length width depth depth_perc
##   <dbl>  <dbl> <dbl> <dbl>      <dbl>
## 1  1.07      0  6.62     0       61.6
## 2  1         0  0        0       63.3
## 3  1.14      0  0        0       57.5
## 4  1.56      0  0        0       62.2
## 5  1.2       0  0        0       62.1
## 6  2.25      0  0        0       62.8
## 7  0.71      0  0        0       64.1

All 7 rows have depth 0 but some value in $depth\%$ column, which is just not possible!

b) Width column

#checking rows with width 0
df1_unique %>% 
  subset(width < 0.001,
         select = c(carat, length, width, depth,depth_perc))

## # A tibble: 6 × 5
##   carat length width depth depth_perc
##   <dbl>  <dbl> <dbl> <dbl>      <dbl>
## 1  1         0     0     0       63.3
## 2  1.14      0     0     0       57.5
## 3  1.56      0     0     0       62.2
## 4  1.2       0     0     0       62.1
## 5  2.25      0     0     0       62.8
## 6  0.71      0     0     0       64.1

All these rows with 0 length and width also have depth=0.

c) Depth column

#see all rows with depth 0
df1_unique %>% 
  subset(depth < 0.001,
         select = c(carat, length, width, depth, depth_perc))

## # A tibble: 19 × 5
##    carat length width depth depth_perc
##    <dbl>  <dbl> <dbl> <dbl>      <dbl>
##  1  1      6.55  6.48     0       59.1
##  2  1.01   6.66  6.6      0       58.1
##  3  1.1    6.5   6.47     0       63  
##  4  1.01   6.5   6.47     0       59.2
##  5  1.5    7.15  7.04     0       64  
##  6  1.07   0     6.62     0       61.6
##  7  1      0     0        0       63.3
##  8  1.15   6.88  6.83     0       59.2
##  9  1.14   0     0        0       57.5
## 10  2.18   8.49  8.45     0       59.4
## 11  1.56   0     0        0       62.2
## 12  2.25   8.52  8.42     0       61.3
## 13  1.2    0     0        0       62.1
## 14  2.2    8.42  8.37     0       61.2
## 15  2.25   0     0        0       62.8
## 16  2.02   8.02  7.95     0       62.7
## 17  2.8    8.9   8.85     0       63.8
## 18  0.71   0     0        0       64.1
## 19  1.12   6.71  6.67     0       60.4

Since there are 19 rows with depth=0.000, I’ll start by first removing these rows and see how many length=0 or width=0 get eliminated.

#remove rows with width=0.000
df1_unique <- df1_unique %>% 
                filter(depth > 0.001)

d) Checking for values in length & width again:

#checking number of rows still left with length=0.000 or width=0.000
df1_unique %>% 
  subset(width < 0.001 | length < 0.001,
         select = c(carat, length, width, depth, depth_perc))

## # A tibble: 0 × 5
## # ℹ 5 variables: carat <dbl>, length <dbl>, width <dbl>, depth <dbl>,
## #   depth_perc <dbl>

Removing rows with depth=0.000 also removes rows with length=0.000, width=0.000.

II. All Data-Distribution Graphs

Following charts have been created to show the range of options available to visualize data.
The next section “Detailed Analysis” contains an in-depth analysis of the dataset.

theme_set(
  theme_bw()+
  theme(axis.text = element_text(color = "black", size = 9, face = "bold"),
        axis.title = element_text(size = 8, face = "bold"),
        plot.title = element_text(size = 9, hjust = 0.5),
        plot.subtitle = element_text(size = 8, hjust = 0.5)
        )
)

1. Categorical Features

Barcharts- good to see total number of data points in each category.
Heatmaps- good to see total count of each sub-category for the main category. In the example below, we can visualize number of diamonds of each color for each type of cut.
Treemaps- can be used to visualize multiple categories within a category. In the e.g. below, the main category is cut, sub-category1 is color and sub-category2 is clarity of diamond. The area of tiles reflects the proportion of data points within each category.

a) Barcharts

#total count of each diamond-type in dataset
df1_unique %>%
    ggplot(aes(x=cut))+
    geom_bar(color = "steelblue", fill = "lightblue", width = 0.5)+
    geom_text(aes(label = after_stat(count)), stat = "count", size = 2,
              fontface = "bold", vjust = -0.5)+
    labs(title = "Ideal diamonds are highest in number",
         x = "type of cut", y = "total count in the dataset")

b) HeatMaps

Such maps can be used to view the ‘group’ with high count/value. For e.g.the map below shows the relationship between diamond cut and color.
The color intensity in the tiles reflect number of datapoints with the corresponding cut and color.
Darker tiles reflect high count.

#group data by cut and color
cut_color_aggr <- df1_unique %>% 
                    group_by(cut, color) %>% 
                    summarise(total_count = n())

#heatmap using 2 ordinal variables- color, cut
cut_color_aggr %>% 
  ggplot(aes(x = cut, y = color, fill = total_count))+
  geom_tile(color = "black")+
  scale_fill_gradient(low = "#FFE900", high = "#E43D00")+
  #remove space between X-Y axis and main plot
  scale_y_discrete(expand = c(0,0))+
  scale_x_discrete(expand = c(0,0))+
  labs(title = "Frequency of diamonds grouped by cut & color ratings",
       subtitle = "Dark color reflects higher count")

c) Treemaps

#data prep- aggregate of the 3 Cs- cut, clarity and color
aggr_3c <- df1_unique %>% 
            group_by(cut, color, clarity) %>% 
            summarise(total_count = n())

#treemap of cut, color,  clarity
treemap(aggr_3c,
  index = c('cut', 'color', 'clarity'),
  vSize = 'total_count',
  type = "index",
  title = "Number of diamonds by cut, color & clarity",
  #increse label size if rectangles are bigger
  inflate.labels = F,
  bg.labels = "yellow",
  #size of labels of each group, sub-group, sub-sub-group
  fontsize.labels = c(10,8,6),
  #border color for group, sub-group, sub-sub-group
  border.col = c("black","black","white")
)

2. Numerical Features

a) Histograms

#simple histogram of carat distribution
carat_hist <- df1_unique %>% 
                ggplot(aes(x=carat))+
                geom_histogram(binwidth = 0.3, fill = "pink",
                               color = "#d45087",
                               #boundary sets the X-labels as start-end values of an interval; and not in center of bar 
                               boundary = 0,
                               closed = "left")+
                scale_y_continuous(expand = expansion(mult = c(0,0.02)))+
                scale_x_continuous(breaks = seq(0,5.30,0.30),
                                   labels = seq(0,5.3,0.3))+
                labs(title = "Simple histogram",
                     subtitle = "total diamond on Y-axis")+
                theme(axis.text = element_text(size = 7))

#Relative frequency histogram with % of data-points on Y-axis.
carat_relative_hist <- df1_unique %>% 
                          ggplot(aes(x=carat))+
                          geom_histogram(aes(y = after_stat(count)/sum(count)),
                                         binwidth = 0.3,
                                         fill = "pink",
                                         color = "#d45087",
                                         boundary = 0,
                                         closed = "left")+
                          scale_x_continuous(breaks = seq(0,5.3,.3),
                                             labels = seq(0,5.3,.3))+
                          scale_y_continuous(expand = expansion(mult = c(0,0.02)))+
                          labs(title = "Relative frequency histogram",
                               subtitle = "relative % on Y axis",
                               y = "freq percentage")+
                          theme(axis.text = element_text(size = 7))

#histogram plots stitched
(carat_hist + carat_relative_hist)+
  plot_annotation(title = "Carat distribution: Two types of histograms",
                  theme = theme(
                    plot.title = element_text(size = 11, face = "bold", hjust = 0.5)))

b) Annotate histogram with mean/median line

In order to add labels like “mean”, “median” along the mean/median lines, first we need to create a text graphical object or ‘Grob’ and then add it to ggplot using annotation_custom().
annotate() can also be used here but annotation_custom() provides more control over the text aesthetics.

#text graphical object for mean/median labels
mean_label <- grid::textGrob(label = "mean",
                            rot = 90, y = 0.80,
                            gp = gpar(fontsize = 7,
                                      fontface = "bold",
                                      col = "darkgreen")
                            )
median_label <- grid::textGrob(label = "median",
                               rot = 90, y = 0.90,
                               gp = gpar(fontsize = 7,
                                         
                                         #fontface = "bold",
                                         col = "blue")
                               )

#mean and median of diamond carats
carat_mean <- mean(df1_unique$carat)
carat_median <- median(df1_unique$carat)

df1_unique %>% 
  ggplot(aes(x=carat))+
  geom_histogram(binwidth = 0.3,
                 fill = "pink",
                 color = "#d45087",
                 boundary = 0,
                 closed = "left")+
  #mean line
  geom_vline(aes(xintercept = mean(carat)), color = "darkgreen", linetype = "dashed")+
  #label mean
  annotation_custom(mean_label, xmin = carat_mean+0.05, xmax = carat_mean+0.05)+
  #median line
  geom_vline(xintercept = carat_median, color = "blue",
             linetype = "dashed", linewidth = 0.5)+
  #label- median
  annotation_custom(median_label, xmin = carat_median-0.05,
                    xmax = carat_median-0.05)+
  scale_x_continuous(breaks = seq(0,5.3,.3),
                     labels = seq(0,5.3,.3))+
  scale_y_continuous(expand = expansion(mult = c(0,0.02)))+
  labs(title = "Carat Distribution")

c) Density Plots

df1_unique %>% 
  ggplot(aes(x=carat, fill = cut))+
  geom_density(outline.type = "full", alpha = 0.9, 
               color = "black")+
  scale_y_continuous(expand = expansion(mult = c(0,0.02)))+
  labs(title = "Density Plot for carats")

d) Frequency Polygon

#frequency polygon for diamond carats- y axis: total count
count_f_poly <- df1_unique %>%
                  ggplot(aes(x=price, color = cut, fill = cut))+
                  geom_freqpoly(binwidth=500)+
                  scale_y_continuous(expand = expansion(mult = c(0,0.02)))+
                  scale_x_continuous(breaks = seq(0,20000,3000),
                                     labels = seq(0,20000,3000))+
                  labs(title = "Frequency Polygon for diamonds price",
                       subtitle = "Y-axis: total count (default)",
                       x = "price (in USD)")+
                  theme(legend.position = "none")

#Y-axis is density, not diamonds count
density_f_poly <- df1_unique %>% 
                  ggplot(aes(x=price, color = cut, fill = cut))+
                  geom_freqpoly(aes(y=after_stat(density)),binwidth=500)+
                  scale_y_continuous(expand = expansion(mult = c(0,0.02)), labels = scales::comma)+
                  scale_x_continuous(breaks = seq(0,20000,3000),
                                     labels = seq(0,20000,3000))+
                  labs(title = "Frequency Polygon for diamonds price",
                       subtitle = "Y-axis: density",
                       x = "price (in USD)")

count_f_poly + density_f_poly

e) Boxplots

df1_unique %>% 
  ggplot(aes(x=cut, y = carat, fill = cut))+
  #add bars at the end of whiskers
  stat_boxplot(geom = "errorbar", width = 0.3)+
  geom_boxplot(outlier.color = "red", outlier.size = 1)+
  labs(title = "Boxplots for carat distribution")

f) Violin Plots

This is a combination of boxplot and density plot.

#violing plot is a combination of box and density plot
df1_unique %>% 
  ggplot(aes(x=cut, y = carat, fill = cut))+
  #trim=F prevents trimming of violin ends, 
  #displays violin for full range of carat
  geom_violin(trim = F, draw_quantiles = c(0.25,0.5,0.75))+
  labs(title = "Violin plot for carat distribution")

g) Rug Plot with Scatter Plot

Rug plots are not separate plots; they just highlight each data point by creating a line for each data point along the axes.

df1_unique %>% 
  filter(cut == "Ideal", color == "D", 
         clarity %in% c("IF","VVS1")) %>% 
  ggplot(aes(x=carat, y=price,color = clarity))+
  geom_point()+
  geom_rug()+
  scale_color_manual(values = c("IF" = "cyan4", "VVS1" = "darkmagenta"))+
  labs(title = "Rug plot + Scatter plot",
       subtitle = "Ideal diamonds with color rating 'D'")

h) Highlight marginal distribution

Marginal plots helps us visualize the distribution intensity at different values of the variable. ggExtra package has been used.

#scatterplot with histogram highlighting marginal distribution
scatter_plot <- df1_unique %>% 
                  filter(cut %in% c("Ideal", "Premium")) %>% 
                  ggplot(aes(x=carat, y = price))+
                  geom_jitter(color = "darkmagenta")+
                  labs(title = "Marginal Distribution through density plot",
                       subtitle = "75% of diamonds have price below $5369, hence the histogram is skewed")

ggExtra::ggMarginal(scatter_plot, type = "histogram",
                    fill = "darkmagenta", margins = 'y')

III. Detailed Analysis

1. Total number of diamonds in each category of cut, color and clarity

##individual plots
#total count of each diamond-type in dataset
cut_count <- df1_unique %>%
    ggplot(aes(x=cut))+
    geom_bar(color = "steelblue", fill = "lightblue", width = 0.5)+
    geom_text(aes(label = paste0(round((after_stat(count)/sum(count))*100,2),"%")), 
              stat = "count", size = 2,
              fontface = "bold", vjust = -0.5)+
    scale_y_continuous(expand = expansion(mult = c(0,0.07)))+
    coord_cartesian(clip = "off")+
    labs(title = "Ideal diamonds are highest in number",
         x = "type of cut", y = "total count in the dataset")

#total count of diamonds of each color type
color_count <- df1_unique %>%
              ggplot(aes(x=color))+
              geom_bar(color = "steelblue", fill = "lightblue", width = 0.5)+
              geom_text(aes(label = paste0(round((after_stat(count)/sum(count))*100,2),"%")), 
                        stat = "count", size = 2,
                        fontface = "bold", vjust = -0.5)+
              scale_y_continuous(expand = expansion(mult = c(0,0.07)))+
              coord_cartesian(clip = "off")+
              labs(title = "Diamonds of color E,F,G form 56.8% of the dataset",
                   x = "color of diamond", y = "")

#total count of diamonds of each clarity type
clarity_count <- df1_unique %>%
              ggplot(aes(x=clarity))+
              geom_bar(color = "steelblue", fill = "lightblue", width = 0.5)+
              geom_text(aes(label = paste0(round((after_stat(count)/sum(count))*100,2),"%")), 
                        stat = "count", size = 2,
                        fontface = "bold", vjust = -0.5)+
              scale_y_continuous(expand = expansion(mult = c(0,0.07)))+
              coord_cartesian(clip = "off")+
              labs(title = "Diamond clarity",
                   x = "diamond clarity", y = "")

#stitch all bar charts
cut_count + color_count + clarity_count

Observations:

The dataset contains super-quality diamonds in terms of their cut; ‘Premium’ & ‘Ideal’ diamonds are most abundant.
Diamonds with color rating ‘G’ are highest in number followed by ‘E’. D-F color rating diamonds are colorless, hence of better standards. G-J rated diamonds have a yellowish tint as we progress from G < H < I < J.
‘G’ diamonds lie on the margin of these 2 groups and they are most abundant in the dataset.
“IF”, ‘Internally Flawless’ diamonds are the best in terms of clarity and they have the second lowest count in our dataset.

The 4Cs- carat, cut, color, clarity are considered the benchmarks for assessing diamonds and in upcoming section, we’ll see how each of these factors affect the diamond price individually and together.

2. Frequency of diamonds based on 2 features

cut_color_heatmap <- df1_unique %>% 
                      group_by(cut, color) %>% 
                      summarize(total_count = n()) %>% 
                      ggplot(aes(x=cut, y = color, fill = total_count))+
                      geom_tile(color = "black")+
                      scale_x_discrete(expand = c(0,0))+
                      scale_y_discrete(expand = c(0,0))+
                      scale_fill_gradient(low = "yellow",
                                          high = "firebrick",
                                          guide = guide_colorbar(title = "diamonds count", title.theme = element_text(size = 7, face = "bold", vjust = 2)))+
                      labs(title = "Ideal diamonds with color rating 'G' are most abundant")

#cut and clarity
cut_clarity_heatmap <- df1_unique %>% 
                      group_by(cut, clarity) %>% 
                      summarize(total_count = n()) %>% 
                      ggplot(aes(x=cut, y = clarity, fill = total_count))+
                      geom_tile(color = "black")+
                      scale_x_discrete(expand = c(0,0))+
                      scale_y_discrete(expand = c(0,0))+
                      scale_fill_gradient(
                                          low = "yellow",
                                          high = "firebrick",
                                          guide = guide_colorbar(title = "diamonds count", title.theme = element_text(size = 7, face = "bold", vjust = 2)))+
                      labs(title = "Ideal diamonds with VS2 clarity are most abundant")

cut_color_heatmap + cut_clarity_heatmap

df1_unique %>% 
  group_by(color, clarity) %>% 
  summarize(count = n()) %>% 
  ggplot(aes(x=color, y=clarity, fill=count))+
  geom_tile(color = "black")+
  scale_x_discrete(expand = c(0,0))+
  scale_y_discrete(expand = c(0,0))+
  scale_fill_gradient(low = "yellow", high = "firebrick",
                      guide = guide_colorbar(title = "diamonds count", title.theme = element_text(size = 8, face = "bold")))+
  labs(title = "Majority of diamonds with color rating 'E' have clarity SI1 or VS2")

3. Which type of diamonds are most abundant considering cut, color and clarity?

#grouped by cut, clarity and color
df1_unique %>% 
  group_by(cut, color, clarity) %>% 
  summarize(total_count = n()) %>% 
  ggplot(aes(x=cut, y= total_count, fill = cut))+
  geom_bar(stat = "identity")+
  scale_y_continuous(expand = expansion(mult = c(0,0.05)))+
  facet_grid(clarity~color)+
  theme(
    strip.text.x = element_text(size = 5, margin = margin(t=0.4,b=0.4), face = "bold"),
    strip.text.y = element_text(size = 5,face = "bold", margin = margin(r=0.4,l=0.4)),
    axis.text = element_text(size = 6),
    axis.text.x = element_text(angle = 90)
  )+
  labs(title = "Diamond frequency based on 3Cs",
       y="")

Observations

Ideal diamonds- ‘E’ color and ‘VS2’ clarity are the most abundant.
Majority of diamods lie in the clarity rating of ‘VS2’,‘SI1’ & ‘SI2’.

4. Trends in Price & Carat with the 3Cs

a) Cut

cut_price <- df1_unique %>% 
              ggplot(aes(x=cut, fill = cut, y = price))+
              stat_boxplot(geom = "errorbar", width = 0.3)+
              geom_boxplot(color = "black", outlier.colour = "red",
                           outlier.size = 0.3)+
              labs(title = "Price distribution of diamonds based on cut")+
  theme(legend.position = "none")
#carat-cut boxplot
cut_carat <- df1_unique %>% 
  ggplot(aes(x=cut, y = carat, fill = cut))+
  stat_boxplot(geom = "errorbar", width = 0.3)+
  geom_boxplot(color = "black", outlier.colour = "red", outlier.size = 0.6)+
  labs(title = "Carat distribution with respect to cut")+
  theme(legend.position = "none")

cut_price + cut_carat

Observation:

Ideal cut diamonds are the best and ‘Fair’ cut is the worst, among all; yet the the median price of Ideal diamonds is the lowest!

#checking the quantile values for Ideal diamonds
print("Carat percentile of Ideal diamonds")

## [1] "Carat percentile of Ideal diamonds"

quantile(df1_unique$carat[df1_unique$cut == "Ideal"])

##   0%  25%  50%  75% 100% 
## 0.20 0.35 0.54 1.01 3.50

Although Ideal diamonds account for 39% of the dataset, 75% of them are below 1 carat & 50% of all ideal diamonds are not even half a carat, which to some extent explains why median price of ideal diamonds is lower.

b) Color

#boxplot of color and price of diamond
color_price <- df1_unique %>% 
  ggplot(aes(x=color, y = price, fill = color))+
  stat_boxplot(geom = "errorbar", width = 0.3)+
  geom_boxplot(color = "black", outlier.colour = "red",
               outlier.size = 0.4)+
  labs(title = "Price distribution with respect to diamond color")+
  theme(legend.position = "none")

#boxplot of color and carat
color_carat <- df1_unique %>% 
  ggplot(aes(x=color, y = carat, fill = color))+
  stat_boxplot(geom = "errorbar", width = 0.3)+
  geom_boxplot(color = "black", outlier.colour = "red",
               outlier.size = 0.4)+
  labs(title = "Carat distribution with respect to diamond color")+
  theme(legend.position = "none") 

#stitch both graphs
color_price + color_carat

Observations:

D-F rated diamonds are colorless and of highest quality, yet as we move from D (best) to J (worst), the median diamond price increases!

print("Carat percentile of D-rated diamonds")

## [1] "Carat percentile of D-rated diamonds"

quantile(df1_unique$carat[df1_unique$color == "D"],
         probs = c(0,25,50,75,90,99)/100)

##   0%  25%  50%  75%  90%  99% 
## 0.20 0.36 0.53 0.91 1.10 1.75

print("Carat percentile of J-rated diamonds")

## [1] "Carat percentile of J-rated diamonds"

quantile(df1_unique$carat[df1_unique$color == "J"],
         probs = c(0,25,50,75,90,99)/100)

##   0%  25%  50%  75%  90%  99% 
## 0.23 0.71 1.11 1.52 2.03 2.51

99% D-rated diamonds are below 1.71 carats.
50% of D-rated diamonds are below 0.5 carats whereas median carat for J-rated diamond is 1.11.

We can observe from these readings that carat of a diamond seems to play a crucial role in price determination even though color, cut is of highest quality.

c) Clarity

#boxplot for price of diamone and their clarity
clarity_price <- df1_unique %>% 
                  ggplot(aes(x=clarity, y = price, fill = clarity))+
                  stat_boxplot(geom = "errorbar", width = 0.3)+
                  geom_boxplot(color = "black", outlier.colour = "red", outlier.size = 0.4)+
                  labs(title = "price distribution with respect to clarity")+
  theme(legend.position = "none")

#boxplot for clarity and carat
clarity_carat <- df1_unique %>% 
                  ggplot(aes(x=clarity, y = carat, fill = clarity))+
                  stat_boxplot(geom = "errorbar", width = 0.3)+
                  geom_boxplot(color = "black", outlier.colour = "red", outlier.size = 0.4)+
                  labs(title = "carat distribution with respect to clarity")+
  theme(legend.position = "none")

#stitch graphs
clarity_price + clarity_carat

Observation

Even though ‘IF’ and ‘VVS1’ are diamonds with maximum brilliance, their median price is less than diamonds with low clarity rating!

So, to summarize:

As the quality of diamonds increased within each rating category- cut, clarity, color, the median price decreased.
This was somewhat unexpected but it also indicates that the price of a diamond cannot be estimated correctly if we know only one or two of these ratings.
The weight of the diamond, carats, seems to play a more important role in price determination (ofc, along with these 3 qualities).

5. Relationship between price and carat of diamonds

(cut_scatter <- df1_unique %>% 
  ggplot(aes(x=carat, y=price, color = cut))+
  geom_point()+
  #scale_y_continuous(expand = expansion(mult = c(0,0.05)))+
  labs(title = "As carat increases, prices tend to go up",
       y = "price in USD"))

6. Correlation between features

The correlation matrix helps to quantify the dependency between all relevant features.
Since, depth_perc is calculated using length, width & depth values, these 3 features have been excluded.

#select relevant features
corr_dataset <- df1_unique %>% 
                  select(cut, color, clarity,depth_perc,carat, price)

There are 3 columns of type ‘factor’ in our data- color, cut & clarity.
Color and Cut have levels in ascending order of the quality. They are ordered from worst to best. For e.g. Quality of diamond cut- “Fair” has level 1 < “Good” is level 2, and so on. Similarly, for clarity- “I1” is 1, …,“IF” is 8, which is the best.
This is not the case with colors column. It is ordered as “D” (1) < “E” (2) < … < “J” (7). D is best rating but it has level of 1.
I’ve changed the order from worst to best for color column below, i.e. “J” gets level 1, “D” gets level 7.

corr_dataset$color <- factor(corr_dataset$color,
                             levels = c("J", "I", "H", "G", "F", "E", "D"), ordered = T)
##checking the order
#levels(corr_dataset$color)[7]

# Convert the factor to numeric
#correlation matrix can be created with numeric values only
#this replaces values like "ideal" with their ordering number.
corr_dataset$color <- as.numeric(corr_dataset$color)
corr_dataset$clarity <- as.numeric(corr_dataset$clarity)
corr_dataset$cut <- as.numeric(corr_dataset$cut)

cor(corr_dataset)

##                    cut       color     clarity  depth_perc       carat
## cut         1.00000000  0.02014858  0.18833670 -0.21713549 -0.13335113
## color       0.02014858  1.00000000 -0.02570228 -0.04765400 -0.29094288
## clarity     0.18833670 -0.02570228  1.00000000 -0.06740010 -0.35219384
## depth_perc -0.21713549 -0.04765400 -0.06740010  1.00000000  0.02788878
## carat      -0.13335113 -0.29094288 -0.35219384  0.02788878  1.00000000
## price      -0.05222510 -0.17174629 -0.14612430 -0.01114383  0.92154847
##                  price
## cut        -0.05222510
## color      -0.17174629
## clarity    -0.14612430
## depth_perc -0.01114383
## carat       0.92154847
## price       1.00000000

#correlation graph
visdat::vis_cor(corr_dataset)

Conclusion

Correlation matrix shows that carat has a very strong positive relationship with price of diamonds.
Other qualitative factors like color, cut, clarity have a weak relationship with price for this dataset.
Carat seems to be the driving factor for price of diamonds.

Diamonds data - Descriptive Statistics

Shefali C.

2024-02-20

Data description

Initial Observations from summary stats:

I. Data Cleaning

1. Missing Values- none

a) Usual method of checking missing values

b) Graph of Missing values

2. Duplicate Rows- found.

3. Rename columns

4. Check for unsual values

a) Length column

b) Width column

c) Depth column

d) Checking for values in length & width again:

II. All Data-Distribution Graphs

1. Categorical Features

a) Barcharts

b) HeatMaps

c) Treemaps

2. Numerical Features

a) Histograms

b) Annotate histogram with mean/median line

c) Density Plots

d) Frequency Polygon

e) Boxplots

f) Violin Plots

g) Rug Plot with Scatter Plot

h) Highlight marginal distribution

III. Detailed Analysis

1. Total number of diamonds in each category of cut, color and clarity

2. Frequency of diamonds based on 2 features

3. Which type of diamonds are most abundant considering cut, color and clarity?

4. Trends in Price & Carat with the 3Cs

a) Cut

b) Color

c) Clarity

5. Relationship between price and carat of diamonds

6. Correlation between features

Conclusion