Exploratory Data Analysis on the Diamonds dataset

The diamonds dataset contains features and pricing on a large set of diamonds. This post shows how to use subject matter knowledge to impute missing values on certain features in this dataset.

Getting the dataset

The diamonds dataset is included in the ggplot package, which is part of the tidyverse set of packages - more information is available at https://www.tidyverse.org/

We begin by installing the tidyverse package

install.packages('tidyverse')

and then using the library command to bring packages into the workspace

library(tidyverse)
## -- Attaching packages -------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0       v purrr   0.3.0  
## v tibble  2.0.1       v dplyr   0.8.0.1
## v tidyr   0.8.2       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0
## -- Conflicts ----------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Initial study of dataset

Let’s have a look at the diamonds dataset structure using str

str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

the str output identifies the fields in the database. Another useful command is glimpse - in the dplyr package

glimpse(diamonds)
## Observations: 53,940
## Variables: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very G...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, ...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...

The glimpse output is better formatted because the factor fields have ‘human readable’ values. Let’s look at the summary of each feature using the summary command

summary(diamonds)
##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 

As you can see, the x,y and z fields show the dimensions of the diamonds in the x direction(length), y direction(width) and z direction(depth). This also means that we are dealing with round diamonds, so in essence the x and y are simply different measurements of the diameters of the stones.

If you notice, the 1st quartile, median, mean and 3rd quartile of the x dimension are similar to those of the y dimension. However, the y max shows a number as high as 58.9. The highest carat in the database is 5.01. Using subject matter knowledge, the maximum dimension of a 5 carat round stone cannot be 58.9 - so this ymax value is an obvious typo. Similarly, z cannot range to 31.8 - this would imply a diamond as tall as 1.25 inches!

Outlier correction

diamonds_corrected <- diamonds

Let’s check out how many outliers need to be corrected

diamonds_corrected %>% filter(y>11 | z>11)
## # A tibble: 3 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  2    Premium   H     SI2      58.9  57   12210  8.09 58.9   8.06
## 2  0.51 Very Good E     VS1      61.8  54.7  1970  5.12  5.15 31.8 
## 3  0.51 Ideal     E     VS1      61.8  55    2075  5.15 31.8   5.12

Since there are only 3 entries, we can manually correct them by setting the y values equal to the corresponding x values. For the z value - using subject matter knowledge, we set the depth to 60% of the length

diamonds_corrected <- diamonds_corrected %>% mutate(y=case_when(y>50 ~ 8.09, y>30 ~ 5.15, TRUE ~ y),z=case_when(z>30 ~ 0.6*5.15, TRUE ~ z))

Let’s check out the outliers again

diamonds_corrected %>% filter(y>11 | z>11)
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>,
## #   clarity <ord>, depth <dbl>, table <dbl>, price <int>, x <dbl>,
## #   y <dbl>, z <dbl>

We get nothing, so the above code worked fine.

The summary also showed zero values for x,y,z which are impossible, because even a diamond with only 0.01 carat of weight has a diameter of approximately 1.3mm - hence the zeroes represent missing values

diamonds_corrected %>% filter(x==0 |y==0 | z==0)
## # A tibble: 20 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  1    Premium   G     SI2      59.1    59  3142  6.55  6.48     0
##  2  1.01 Premium   H     I1       58.1    59  3167  6.66  6.6      0
##  3  1.1  Premium   G     SI2      63      59  3696  6.5   6.47     0
##  4  1.01 Premium   F     SI2      59.2    58  3837  6.5   6.47     0
##  5  1.5  Good      G     I1       64      61  4731  7.15  7.04     0
##  6  1.07 Ideal     F     SI2      61.6    56  4954  0     6.62     0
##  7  1    Very Good H     VS2      63.3    53  5139  0     0        0
##  8  1.15 Ideal     G     VS2      59.2    56  5564  6.88  6.83     0
##  9  1.14 Fair      G     VS1      57.5    67  6381  0     0        0
## 10  2.18 Premium   H     SI2      59.4    61 12631  8.49  8.45     0
## 11  1.56 Ideal     G     VS2      62.2    54 12800  0     0        0
## 12  2.25 Premium   I     SI1      61.3    58 15397  8.52  8.42     0
## 13  1.2  Premium   D     VVS1     62.1    59 15686  0     0        0
## 14  2.2  Premium   H     SI1      61.2    59 17265  8.42  8.37     0
## 15  2.25 Premium   H     SI2      62.8    59 18034  0     0        0
## 16  2.02 Premium   H     VS2      62.7    53 18207  8.02  7.95     0
## 17  2.8  Good      G     SI2      63.8    58 18788  8.9   8.85     0
## 18  0.71 Good      F     SI2      64.1    60  2130  0     0        0
## 19  0.71 Good      F     SI2      64.1    60  2130  0     0        0
## 20  1.12 Premium   G     I1       60.4    59  2383  6.71  6.67     0

The output shows 20 data points with zero values in x,y or z - these are missing values and need to be imputed from the remaining data and filled in. Subject matter knowledge We need to be careful here to carry out this correction; the missing values need to be filled in by imputation using grouped means. A straight average of the x values or y values across the entire dataset will be incorrect. This is because using subject matter knowledge, the x and y values (diameters) are dependent on both the carat weight as well as the cut of the diamond.

diamonds_corrected <- diamonds_corrected %>% group_by(carat,cut) %>% 
  mutate(x=case_when(x==0 ~ mean(x), TRUE ~ x), y=case_when(y==0 ~ mean(y), TRUE ~ y))

After the above correction, when we check the zero values, we see something strange.

diamonds_corrected %>% filter(x==0 |y==0 | z==0)
## # A tibble: 20 x 10
## # Groups:   carat, cut [17]
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  1    Premium   G     SI2      59.1    59  3142  6.55  6.48     0
##  2  1.01 Premium   H     I1       58.1    59  3167  6.66  6.6      0
##  3  1.1  Premium   G     SI2      63      59  3696  6.5   6.47     0
##  4  1.01 Premium   F     SI2      59.2    58  3837  6.5   6.47     0
##  5  1.5  Good      G     I1       64      61  4731  7.15  7.04     0
##  6  1.07 Ideal     F     SI2      61.6    56  4954  6.53  6.62     0
##  7  1    Very Good H     VS2      63.3    53  5139  6.35  6.38     0
##  8  1.15 Ideal     G     VS2      59.2    56  5564  6.88  6.83     0
##  9  1.14 Fair      G     VS1      57.5    67  6381  3.28  3.27     0
## 10  2.18 Premium   H     SI2      59.4    61 12631  8.49  8.45     0
## 11  1.56 Ideal     G     VS2      62.2    54 12800  7.28  7.29     0
## 12  2.25 Premium   I     SI1      61.3    58 15397  8.52  8.42     0
## 13  1.2  Premium   D     VVS1     62.1    59 15686  6.80  6.74     0
## 14  2.2  Premium   H     SI1      61.2    59 17265  8.42  8.37     0
## 15  2.25 Premium   H     SI2      62.8    59 18034  7.28  7.22     0
## 16  2.02 Premium   H     VS2      62.7    53 18207  8.02  7.95     0
## 17  2.8  Good      G     SI2      63.8    58 18788  8.9   8.85     0
## 18  0.71 Good      F     SI2      64.1    60  2130  5.59  5.60     0
## 19  0.71 Good      F     SI2      64.1    60  2130  5.59  5.60     0
## 20  1.12 Premium   G     I1       60.4    59  2383  6.71  6.67     0

The diamond with a carat weight of 1.14 and Fair cut has an x value of 3.28, while the diamond with a carat weight of marginal difference 1.15 has an x value of 6.88. Using subject matter knowledge, the usual diameter of a 1 carat stone is 6.2, so something has gone wrong with the stone of 1.14.

Let’s investigate the stones with weight 1.14 and Fair cut

diamonds_corrected %>% filter(carat==1.14, cut=="Fair")
## # A tibble: 2 x 10
## # Groups:   carat, cut [1]
##   carat cut   color clarity depth table price     x     y     z
##   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1.14 Fair  J     SI2      64.4    55  3403  6.56  6.54  4.22
## 2  1.14 Fair  G     VS1      57.5    67  6381  3.28  3.27  0

Ah, we see that the mean has taken the incorrect zero value and included it in the average. We need to remove the zero values when we take the mean.

Restarting the data set

diamonds_corrected <- diamonds
diamonds_corrected <- diamonds_corrected %>% mutate(y=case_when(y>50 ~ 8.09, y>30 ~ 5.15, TRUE ~ y),z=case_when(z>30 ~ 0.6*5.15, TRUE ~ z))
diamonds_corrected <- diamonds_corrected %>% group_by(carat,cut) %>% 
  mutate(x=case_when(x==0 ~ mean(x[x!=0]), TRUE ~ x), y=case_when(y==0 ~ mean(y[y!=0]), TRUE ~ y))

If we run the check again

diamonds_corrected %>% filter(x==0 |y==0 | z==0)
## # A tibble: 20 x 10
## # Groups:   carat, cut [17]
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  1    Premium   G     SI2      59.1    59  3142  6.55  6.48     0
##  2  1.01 Premium   H     I1       58.1    59  3167  6.66  6.6      0
##  3  1.1  Premium   G     SI2      63      59  3696  6.5   6.47     0
##  4  1.01 Premium   F     SI2      59.2    58  3837  6.5   6.47     0
##  5  1.5  Good      G     I1       64      61  4731  7.15  7.04     0
##  6  1.07 Ideal     F     SI2      61.6    56  4954  6.57  6.62     0
##  7  1    Very Good H     VS2      63.3    53  5139  6.37  6.40     0
##  8  1.15 Ideal     G     VS2      59.2    56  5564  6.88  6.83     0
##  9  1.14 Fair      G     VS1      57.5    67  6381  6.56  6.54     0
## 10  2.18 Premium   H     SI2      59.4    61 12631  8.49  8.45     0
## 11  1.56 Ideal     G     VS2      62.2    54 12800  7.45  7.46     0
## 12  2.25 Premium   I     SI1      61.3    58 15397  8.52  8.42     0
## 13  1.2  Premium   D     VVS1     62.1    59 15686  6.83  6.78     0
## 14  2.2  Premium   H     SI1      61.2    59 17265  8.42  8.37     0
## 15  2.25 Premium   H     SI2      62.8    59 18034  8.49  8.43     0
## 16  2.02 Premium   H     VS2      62.7    53 18207  8.02  7.95     0
## 17  2.8  Good      G     SI2      63.8    58 18788  8.9   8.85     0
## 18  0.71 Good      F     SI2      64.1    60  2130  5.68  5.69     0
## 19  0.71 Good      F     SI2      64.1    60  2130  5.68  5.69     0
## 20  1.12 Premium   G     I1       60.4    59  2383  6.71  6.67     0

The problem has gone away. Now we turn our attention to the Z values. Here, we again use some subject matter knowledge to fill in z values as 60% of the diameter. So we use 30% times the (x+y)

diamonds_corrected <- diamonds_corrected %>% mutate(z=case_when(z==0 ~ 0.3*(x+y),TRUE ~ z))

checking the zero filter again -

diamonds_corrected %>% filter(x==0 |y==0 | z==0)
## Warning: Factor `cut` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 0 x 10
## # Groups:   carat, cut [1]
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>,
## #   clarity <ord>, depth <dbl>, table <dbl>, price <int>, x <dbl>,
## #   y <dbl>, z <dbl>

So there are no more zeroes in the x,y and z fields.

Outlier detection can defy standard statistical tests

Investigating the other outliers - let’s look at the table distribution

summary(diamonds_corrected$table)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   43.00   56.00   57.00   57.46   59.00   95.00

The table occupies a range of values - approximately 60% is an ideal diamond, but it can vary from 40 to 70%. 95% is most likely an error. Let’s see how many data points have “outlier-like” large values

diamonds_corrected %>% filter(table>70)
## # A tibble: 8 x 10
## # Groups:   carat, cut [7]
##   carat cut   color clarity depth table price     x     y     z
##   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  2.01 Fair  F     SI1      58.6    95 13387  8.32  8.31  4.87
## 2  0.68 Fair  G     SI1      58      71  1633  5.85  5.7   3.35
## 3  0.7  Fair  H     VS1      62      73  2100  5.65  5.54  3.47
## 4  0.81 Fair  F     SI2      68.8    79  2301  5.26  5.2   3.58
## 5  0.79 Fair  G     SI1      65.3    76  2362  5.52  5.13  3.35
## 6  0.71 Fair  D     VS2      55.6    73  2368  6.01  5.96  3.33
## 7  0.5  Fair  E     VS2      79      73  2579  5.21  5.18  4.09
## 8  0.5  Fair  E     VS2      79      73  2579  5.21  5.18  4.09

Typically Fair cuts will have extreme values for table percentages,i.e., which is clearly seen in the data above. Let’s plot the data using a simple histogram of these values

diamonds_corrected %>% filter(table>70) %>% ggplot(aes(x=table))+geom_bar()

As you can see, the lone point out at 95 is clearly an outlier.

diamonds_corrected <- diamonds_corrected %>% group_by(carat,cut) %>% 
  mutate(table=case_when(table>90 ~mean(table[table<90]), TRUE ~ table))

If we study the stones with carat weight 2.01 and Fair cut

diamonds_corrected %>% filter(carat==2.01,cut=="Fair")
## # A tibble: 31 x 10
## # Groups:   carat, cut [1]
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  2.01 Fair  I     I1       67.4    58  5696  7.71  7.64  5.17
##  2  2.01 Fair  I     I1       55.9    64  5696  8.48  8.39  4.71
##  3  2.01 Fair  G     I1       70.2    57  6315  7.53  7.5   5.27
##  4  2.01 Fair  F     I1       58.7    66  7294  8.3   8.19  4.84
##  5  2.01 Fair  H     SI2      64.9    56 10184  7.88  7.81  5.09
##  6  2.01 Fair  H     SI2      64.9    56 10184  7.88  7.81  5.09
##  7  2.01 Fair  I     SI2      66.5    61 10671  7.81  7.75  5.17
##  8  2.01 Fair  I     SI2      66.9    58 10711  7.93  7.69  5.21
##  9  2.01 Fair  H     SI2      65.3    56 10772  7.95  7.86  5.16
## 10  2.01 Fair  H     SI2      66.7    56 10772  7.8   7.76  5.19
## # ... with 21 more rows

A visual look at the values confirms reasonable values for the table. We can look at a histogram of these values

diamonds_corrected %>% filter(carat==2.01,cut=="Fair") %>% ggplot(aes(x=table))+geom_bar()

The values are closely gathered around 60%, which makes sense.

Another interesting outlier test that does not seem to work here is using standardized z tests on the original diamonds dataset

diamonds %>% group_by(carat,cut) %>% mutate(table_z=(table-mean(table))/sd(table)) %>% 
  summarize(min(table_z), t_max=max(table_z), no_ston=n()) %>% 
  filter(no_ston>1) %>% arrange(desc(t_max))
## # A tibble: 930 x 5
## # Groups:   carat [241]
##    carat cut   `min(table_z)` t_max no_ston
##    <dbl> <ord>          <dbl> <dbl>   <int>
##  1  0.52 Ideal         -2.68   5.76     459
##  2  0.55 Ideal         -2.70   5.75     302
##  3  0.51 Ideal         -2.41   5.15     525
##  4  0.27 Ideal         -2.45   5.13     113
##  5  0.39 Ideal         -2.10   5.03     181
##  6  0.31 Ideal         -2.05   5.00    1209
##  7  0.38 Ideal         -1.95   4.84     378
##  8  0.4  Ideal         -3.00   4.78     545
##  9  2.01 Fair          -0.977  4.62      31
## 10  1.1  Ideal         -2.15   4.62     118
## # ... with 920 more rows

Since many t_max show up as 5 standard deviations away from the mean, we check out a few examples

diamonds %>% filter(carat==0.52,cut=="Ideal") %>% 
  summarize(min(table),mean(table),max(table),sd(table))
## # A tibble: 1 x 4
##   `min(table)` `mean(table)` `max(table)` `sd(table)`
##          <dbl>         <dbl>        <dbl>       <dbl>
## 1           53          55.9           62        1.07

The minimum and maximum are still within reasonable ranges if evaluated per subject matter expertise, but since a large many stones are crowded around the mean of 55.8, a few far off stones with a table of over 60% show up as 5 or more standard deviations away from the mean. Using this common metric of standard deviations will not result in these values of tables being outliers, but the 95% value is definitely an outlier.

Interestingly, if we study the standard deviations of the stone with this outlier value

diamonds %>% filter(carat==2.01,cut=="Fair") %>% summarize(min(table), mean(table),max(table),sd(table),zplus=(max(table)-mean(table))/sd(table))
## # A tibble: 1 x 5
##   `min(table)` `mean(table)` `max(table)` `sd(table)` zplus
##          <dbl>         <dbl>        <dbl>       <dbl> <dbl>
## 1           53          60.3           95        7.50  4.62

The Z value seems high at 4.62, but not nearly as high as 5.76 as in the case of the 0.52 weight, Ideal cut stone.

Conclusion

Outlier detection and correction requires substantial understanding of data and judicious use of subject matter expertise