The diamonds dataset contains features and pricing on a large set of diamonds. This post shows how to use subject matter knowledge to impute missing values on certain features in this dataset.
The diamonds dataset is included in the ggplot package, which is part of the tidyverse set of packages - more information is available at https://www.tidyverse.org/
We begin by installing the tidyverse package
install.packages('tidyverse')
and then using the library command to bring packages into the workspace
library(tidyverse)
## -- Attaching packages -------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.3.0
## v tibble 2.0.1 v dplyr 0.8.0.1
## v tidyr 0.8.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Let’s have a look at the diamonds dataset structure using str
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
the str output identifies the fields in the database. Another useful command is glimpse - in the dplyr package
glimpse(diamonds)
## Observations: 53,940
## Variables: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very G...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, ...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...
The glimpse output is better formatted because the factor fields have ‘human readable’ values. Let’s look at the summary of each feature using the summary command
summary(diamonds)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
As you can see, the x,y and z fields show the dimensions of the diamonds in the x direction(length), y direction(width) and z direction(depth). This also means that we are dealing with round diamonds, so in essence the x and y are simply different measurements of the diameters of the stones.
If you notice, the 1st quartile, median, mean and 3rd quartile of the x dimension are similar to those of the y dimension. However, the y max shows a number as high as 58.9. The highest carat in the database is 5.01. Using subject matter knowledge, the maximum dimension of a 5 carat round stone cannot be 58.9 - so this ymax value is an obvious typo. Similarly, z cannot range to 31.8 - this would imply a diamond as tall as 1.25 inches!
diamonds_corrected <- diamonds
Let’s check out how many outliers need to be corrected
diamonds_corrected %>% filter(y>11 | z>11)
## # A tibble: 3 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 2 Premium H SI2 58.9 57 12210 8.09 58.9 8.06
## 2 0.51 Very Good E VS1 61.8 54.7 1970 5.12 5.15 31.8
## 3 0.51 Ideal E VS1 61.8 55 2075 5.15 31.8 5.12
Since there are only 3 entries, we can manually correct them by setting the y values equal to the corresponding x values. For the z value - using subject matter knowledge, we set the depth to 60% of the length
diamonds_corrected <- diamonds_corrected %>% mutate(y=case_when(y>50 ~ 8.09, y>30 ~ 5.15, TRUE ~ y),z=case_when(z>30 ~ 0.6*5.15, TRUE ~ z))
Let’s check out the outliers again
diamonds_corrected %>% filter(y>11 | z>11)
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>,
## # clarity <ord>, depth <dbl>, table <dbl>, price <int>, x <dbl>,
## # y <dbl>, z <dbl>
We get nothing, so the above code worked fine.
The summary also showed zero values for x,y,z which are impossible, because even a diamond with only 0.01 carat of weight has a diameter of approximately 1.3mm - hence the zeroes represent missing values
diamonds_corrected %>% filter(x==0 |y==0 | z==0)
## # A tibble: 20 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 Premium G SI2 59.1 59 3142 6.55 6.48 0
## 2 1.01 Premium H I1 58.1 59 3167 6.66 6.6 0
## 3 1.1 Premium G SI2 63 59 3696 6.5 6.47 0
## 4 1.01 Premium F SI2 59.2 58 3837 6.5 6.47 0
## 5 1.5 Good G I1 64 61 4731 7.15 7.04 0
## 6 1.07 Ideal F SI2 61.6 56 4954 0 6.62 0
## 7 1 Very Good H VS2 63.3 53 5139 0 0 0
## 8 1.15 Ideal G VS2 59.2 56 5564 6.88 6.83 0
## 9 1.14 Fair G VS1 57.5 67 6381 0 0 0
## 10 2.18 Premium H SI2 59.4 61 12631 8.49 8.45 0
## 11 1.56 Ideal G VS2 62.2 54 12800 0 0 0
## 12 2.25 Premium I SI1 61.3 58 15397 8.52 8.42 0
## 13 1.2 Premium D VVS1 62.1 59 15686 0 0 0
## 14 2.2 Premium H SI1 61.2 59 17265 8.42 8.37 0
## 15 2.25 Premium H SI2 62.8 59 18034 0 0 0
## 16 2.02 Premium H VS2 62.7 53 18207 8.02 7.95 0
## 17 2.8 Good G SI2 63.8 58 18788 8.9 8.85 0
## 18 0.71 Good F SI2 64.1 60 2130 0 0 0
## 19 0.71 Good F SI2 64.1 60 2130 0 0 0
## 20 1.12 Premium G I1 60.4 59 2383 6.71 6.67 0
The output shows 20 data points with zero values in x,y or z - these are missing values and need to be imputed from the remaining data and filled in. Subject matter knowledge We need to be careful here to carry out this correction; the missing values need to be filled in by imputation using grouped means. A straight average of the x values or y values across the entire dataset will be incorrect. This is because using subject matter knowledge, the x and y values (diameters) are dependent on both the carat weight as well as the cut of the diamond.
diamonds_corrected <- diamonds_corrected %>% group_by(carat,cut) %>%
mutate(x=case_when(x==0 ~ mean(x), TRUE ~ x), y=case_when(y==0 ~ mean(y), TRUE ~ y))
After the above correction, when we check the zero values, we see something strange.
diamonds_corrected %>% filter(x==0 |y==0 | z==0)
## # A tibble: 20 x 10
## # Groups: carat, cut [17]
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 Premium G SI2 59.1 59 3142 6.55 6.48 0
## 2 1.01 Premium H I1 58.1 59 3167 6.66 6.6 0
## 3 1.1 Premium G SI2 63 59 3696 6.5 6.47 0
## 4 1.01 Premium F SI2 59.2 58 3837 6.5 6.47 0
## 5 1.5 Good G I1 64 61 4731 7.15 7.04 0
## 6 1.07 Ideal F SI2 61.6 56 4954 6.53 6.62 0
## 7 1 Very Good H VS2 63.3 53 5139 6.35 6.38 0
## 8 1.15 Ideal G VS2 59.2 56 5564 6.88 6.83 0
## 9 1.14 Fair G VS1 57.5 67 6381 3.28 3.27 0
## 10 2.18 Premium H SI2 59.4 61 12631 8.49 8.45 0
## 11 1.56 Ideal G VS2 62.2 54 12800 7.28 7.29 0
## 12 2.25 Premium I SI1 61.3 58 15397 8.52 8.42 0
## 13 1.2 Premium D VVS1 62.1 59 15686 6.80 6.74 0
## 14 2.2 Premium H SI1 61.2 59 17265 8.42 8.37 0
## 15 2.25 Premium H SI2 62.8 59 18034 7.28 7.22 0
## 16 2.02 Premium H VS2 62.7 53 18207 8.02 7.95 0
## 17 2.8 Good G SI2 63.8 58 18788 8.9 8.85 0
## 18 0.71 Good F SI2 64.1 60 2130 5.59 5.60 0
## 19 0.71 Good F SI2 64.1 60 2130 5.59 5.60 0
## 20 1.12 Premium G I1 60.4 59 2383 6.71 6.67 0
The diamond with a carat weight of 1.14 and Fair cut has an x value of 3.28, while the diamond with a carat weight of marginal difference 1.15 has an x value of 6.88. Using subject matter knowledge, the usual diameter of a 1 carat stone is 6.2, so something has gone wrong with the stone of 1.14.
Let’s investigate the stones with weight 1.14 and Fair cut
diamonds_corrected %>% filter(carat==1.14, cut=="Fair")
## # A tibble: 2 x 10
## # Groups: carat, cut [1]
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1.14 Fair J SI2 64.4 55 3403 6.56 6.54 4.22
## 2 1.14 Fair G VS1 57.5 67 6381 3.28 3.27 0
Ah, we see that the mean has taken the incorrect zero value and included it in the average. We need to remove the zero values when we take the mean.
Restarting the data set
diamonds_corrected <- diamonds
diamonds_corrected <- diamonds_corrected %>% mutate(y=case_when(y>50 ~ 8.09, y>30 ~ 5.15, TRUE ~ y),z=case_when(z>30 ~ 0.6*5.15, TRUE ~ z))
diamonds_corrected <- diamonds_corrected %>% group_by(carat,cut) %>%
mutate(x=case_when(x==0 ~ mean(x[x!=0]), TRUE ~ x), y=case_when(y==0 ~ mean(y[y!=0]), TRUE ~ y))
If we run the check again
diamonds_corrected %>% filter(x==0 |y==0 | z==0)
## # A tibble: 20 x 10
## # Groups: carat, cut [17]
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 Premium G SI2 59.1 59 3142 6.55 6.48 0
## 2 1.01 Premium H I1 58.1 59 3167 6.66 6.6 0
## 3 1.1 Premium G SI2 63 59 3696 6.5 6.47 0
## 4 1.01 Premium F SI2 59.2 58 3837 6.5 6.47 0
## 5 1.5 Good G I1 64 61 4731 7.15 7.04 0
## 6 1.07 Ideal F SI2 61.6 56 4954 6.57 6.62 0
## 7 1 Very Good H VS2 63.3 53 5139 6.37 6.40 0
## 8 1.15 Ideal G VS2 59.2 56 5564 6.88 6.83 0
## 9 1.14 Fair G VS1 57.5 67 6381 6.56 6.54 0
## 10 2.18 Premium H SI2 59.4 61 12631 8.49 8.45 0
## 11 1.56 Ideal G VS2 62.2 54 12800 7.45 7.46 0
## 12 2.25 Premium I SI1 61.3 58 15397 8.52 8.42 0
## 13 1.2 Premium D VVS1 62.1 59 15686 6.83 6.78 0
## 14 2.2 Premium H SI1 61.2 59 17265 8.42 8.37 0
## 15 2.25 Premium H SI2 62.8 59 18034 8.49 8.43 0
## 16 2.02 Premium H VS2 62.7 53 18207 8.02 7.95 0
## 17 2.8 Good G SI2 63.8 58 18788 8.9 8.85 0
## 18 0.71 Good F SI2 64.1 60 2130 5.68 5.69 0
## 19 0.71 Good F SI2 64.1 60 2130 5.68 5.69 0
## 20 1.12 Premium G I1 60.4 59 2383 6.71 6.67 0
The problem has gone away. Now we turn our attention to the Z values. Here, we again use some subject matter knowledge to fill in z values as 60% of the diameter. So we use 30% times the (x+y)
diamonds_corrected <- diamonds_corrected %>% mutate(z=case_when(z==0 ~ 0.3*(x+y),TRUE ~ z))
checking the zero filter again -
diamonds_corrected %>% filter(x==0 |y==0 | z==0)
## Warning: Factor `cut` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 0 x 10
## # Groups: carat, cut [1]
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>,
## # clarity <ord>, depth <dbl>, table <dbl>, price <int>, x <dbl>,
## # y <dbl>, z <dbl>
So there are no more zeroes in the x,y and z fields.
Investigating the other outliers - let’s look at the table distribution
summary(diamonds_corrected$table)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43.00 56.00 57.00 57.46 59.00 95.00
The table occupies a range of values - approximately 60% is an ideal diamond, but it can vary from 40 to 70%. 95% is most likely an error. Let’s see how many data points have “outlier-like” large values
diamonds_corrected %>% filter(table>70)
## # A tibble: 8 x 10
## # Groups: carat, cut [7]
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 2.01 Fair F SI1 58.6 95 13387 8.32 8.31 4.87
## 2 0.68 Fair G SI1 58 71 1633 5.85 5.7 3.35
## 3 0.7 Fair H VS1 62 73 2100 5.65 5.54 3.47
## 4 0.81 Fair F SI2 68.8 79 2301 5.26 5.2 3.58
## 5 0.79 Fair G SI1 65.3 76 2362 5.52 5.13 3.35
## 6 0.71 Fair D VS2 55.6 73 2368 6.01 5.96 3.33
## 7 0.5 Fair E VS2 79 73 2579 5.21 5.18 4.09
## 8 0.5 Fair E VS2 79 73 2579 5.21 5.18 4.09
Typically Fair cuts will have extreme values for table percentages,i.e., which is clearly seen in the data above. Let’s plot the data using a simple histogram of these values
diamonds_corrected %>% filter(table>70) %>% ggplot(aes(x=table))+geom_bar()
As you can see, the lone point out at 95 is clearly an outlier.
diamonds_corrected <- diamonds_corrected %>% group_by(carat,cut) %>%
mutate(table=case_when(table>90 ~mean(table[table<90]), TRUE ~ table))
If we study the stones with carat weight 2.01 and Fair cut
diamonds_corrected %>% filter(carat==2.01,cut=="Fair")
## # A tibble: 31 x 10
## # Groups: carat, cut [1]
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 2.01 Fair I I1 67.4 58 5696 7.71 7.64 5.17
## 2 2.01 Fair I I1 55.9 64 5696 8.48 8.39 4.71
## 3 2.01 Fair G I1 70.2 57 6315 7.53 7.5 5.27
## 4 2.01 Fair F I1 58.7 66 7294 8.3 8.19 4.84
## 5 2.01 Fair H SI2 64.9 56 10184 7.88 7.81 5.09
## 6 2.01 Fair H SI2 64.9 56 10184 7.88 7.81 5.09
## 7 2.01 Fair I SI2 66.5 61 10671 7.81 7.75 5.17
## 8 2.01 Fair I SI2 66.9 58 10711 7.93 7.69 5.21
## 9 2.01 Fair H SI2 65.3 56 10772 7.95 7.86 5.16
## 10 2.01 Fair H SI2 66.7 56 10772 7.8 7.76 5.19
## # ... with 21 more rows
A visual look at the values confirms reasonable values for the table. We can look at a histogram of these values
diamonds_corrected %>% filter(carat==2.01,cut=="Fair") %>% ggplot(aes(x=table))+geom_bar()
The values are closely gathered around 60%, which makes sense.
Another interesting outlier test that does not seem to work here is using standardized z tests on the original diamonds dataset
diamonds %>% group_by(carat,cut) %>% mutate(table_z=(table-mean(table))/sd(table)) %>%
summarize(min(table_z), t_max=max(table_z), no_ston=n()) %>%
filter(no_ston>1) %>% arrange(desc(t_max))
## # A tibble: 930 x 5
## # Groups: carat [241]
## carat cut `min(table_z)` t_max no_ston
## <dbl> <ord> <dbl> <dbl> <int>
## 1 0.52 Ideal -2.68 5.76 459
## 2 0.55 Ideal -2.70 5.75 302
## 3 0.51 Ideal -2.41 5.15 525
## 4 0.27 Ideal -2.45 5.13 113
## 5 0.39 Ideal -2.10 5.03 181
## 6 0.31 Ideal -2.05 5.00 1209
## 7 0.38 Ideal -1.95 4.84 378
## 8 0.4 Ideal -3.00 4.78 545
## 9 2.01 Fair -0.977 4.62 31
## 10 1.1 Ideal -2.15 4.62 118
## # ... with 920 more rows
Since many t_max show up as 5 standard deviations away from the mean, we check out a few examples
diamonds %>% filter(carat==0.52,cut=="Ideal") %>%
summarize(min(table),mean(table),max(table),sd(table))
## # A tibble: 1 x 4
## `min(table)` `mean(table)` `max(table)` `sd(table)`
## <dbl> <dbl> <dbl> <dbl>
## 1 53 55.9 62 1.07
The minimum and maximum are still within reasonable ranges if evaluated per subject matter expertise, but since a large many stones are crowded around the mean of 55.8, a few far off stones with a table of over 60% show up as 5 or more standard deviations away from the mean. Using this common metric of standard deviations will not result in these values of tables being outliers, but the 95% value is definitely an outlier.
Interestingly, if we study the standard deviations of the stone with this outlier value
diamonds %>% filter(carat==2.01,cut=="Fair") %>% summarize(min(table), mean(table),max(table),sd(table),zplus=(max(table)-mean(table))/sd(table))
## # A tibble: 1 x 5
## `min(table)` `mean(table)` `max(table)` `sd(table)` zplus
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 53 60.3 95 7.50 4.62
The Z value seems high at 4.62, but not nearly as high as 5.76 as in the case of the 0.52 weight, Ideal cut stone.
Outlier detection and correction requires substantial understanding of data and judicious use of subject matter expertise