Question

Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

First, calculate summary statistics for these variables and plot their distributions.

summary(select(diamonds, x, y, z))
##        x                y                z         
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.700   Median : 5.710   Median : 3.530  
##  Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :10.740   Max.   :58.900   Max.   :31.800
ggplot(diamonds) +
  geom_histogram(mapping = aes(x = x), binwidth = 0.01)

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = y), binwidth = 0.01)

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = z), binwidth = 0.01)

There several noticeable features of the distributions:

  1. x and y are larger than z,
  2. there are outliers,
  3. they are all right skewed, and
  4. they are multimodal or “spiky”.

The typical values of x and y are larger than z, with x and y having inter-quartile ranges of 4.7–6.5, while z has an interquartile range of 2.9–4.0.

There are two types of outliers in this data. Some diamonds have values of zero and some have abnormally large values of x, y, or z (review the results of summary() above to see these). These appear to be either data entry errors, or an undocumented convention in the dataset for indicating missing values. An alternative hypothesis would be that values of zero are the result of rounding values like 0.002 down, but since there are no diamonds with values of 0.01, that does not seem to be the case.

filter(diamonds, x == 0 | y == 0 | z == 0)
## # A tibble: 20 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  1    Premium   G     SI2      59.1    59  3142  6.55  6.48     0
##  2  1.01 Premium   H     I1       58.1    59  3167  6.66  6.6      0
##  3  1.1  Premium   G     SI2      63      59  3696  6.5   6.47     0
##  4  1.01 Premium   F     SI2      59.2    58  3837  6.5   6.47     0
##  5  1.5  Good      G     I1       64      61  4731  7.15  7.04     0
##  6  1.07 Ideal     F     SI2      61.6    56  4954  0     6.62     0
##  7  1    Very Good H     VS2      63.3    53  5139  0     0        0
##  8  1.15 Ideal     G     VS2      59.2    56  5564  6.88  6.83     0
##  9  1.14 Fair      G     VS1      57.5    67  6381  0     0        0
## 10  2.18 Premium   H     SI2      59.4    61 12631  8.49  8.45     0
## 11  1.56 Ideal     G     VS2      62.2    54 12800  0     0        0
## 12  2.25 Premium   I     SI1      61.3    58 15397  8.52  8.42     0
## 13  1.2  Premium   D     VVS1     62.1    59 15686  0     0        0
## 14  2.2  Premium   H     SI1      61.2    59 17265  8.42  8.37     0
## 15  2.25 Premium   H     SI2      62.8    59 18034  0     0        0
## 16  2.02 Premium   H     VS2      62.7    53 18207  8.02  7.95     0
## 17  2.8  Good      G     SI2      63.8    58 18788  8.9   8.85     0
## 18  0.71 Good      F     SI2      64.1    60  2130  0     0        0
## 19  0.71 Good      F     SI2      64.1    60  2130  0     0        0
## 20  1.12 Premium   G     I1       60.4    59  2383  6.71  6.67     0

There are also some diamonds with values of y and z that are abnormally large. There are diamonds with y == 58.9 and y == 31.8, and one with z == 31.8. These are probably data errors since the values do not seem in line with the values of the other variables.

diamonds %>%
  arrange(desc(y)) %>%
  head()
## # A tibble: 6 x 10
##   carat cut     color clarity depth table price     x     y     z
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  2    Premium H     SI2      58.9    57 12210  8.09 58.9   8.06
## 2  0.51 Ideal   E     VS1      61.8    55  2075  5.15 31.8   5.12
## 3  5.01 Fair    J     I1       65.5    59 18018 10.7  10.5   6.98
## 4  4.5  Fair    J     I1       65.8    58 18531 10.2  10.2   6.72
## 5  4.01 Premium I     I1       61      61 15223 10.1  10.1   6.17
## 6  4.01 Premium J     I1       62.5    62 15223 10.0   9.94  6.24
diamonds %>%
  arrange(desc(z)) %>%
  head()
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.51 Very Good E     VS1      61.8  54.7  1970  5.12  5.15 31.8 
## 2  2    Premium   H     SI2      58.9  57   12210  8.09 58.9   8.06
## 3  5.01 Fair      J     I1       65.5  59   18018 10.7  10.5   6.98
## 4  4.5  Fair      J     I1       65.8  58   18531 10.2  10.2   6.72
## 5  4.13 Fair      H     I1       64.8  61   17329 10     9.85  6.43
## 6  3.65 Fair      H     I1       67.1  53   11668  9.53  9.48  6.38

So far, only univariate outliers have been considered. However, to check the plausibility of those outliers, it would be useful to consider how consistent their values are with the values of the other variables. In this case, scatter plots of each combination of x, y, and z shows these outliers much more clearly.

ggplot(diamonds, aes(x = x, y = y)) +
  geom_point()

ggplot(diamonds, aes(x = x, y = z)) +
  geom_point()

ggplot(diamonds, aes(x = y, y = z)) +
  geom_point()

Removing the outliers from x, y, and z makes the distribution easier to see. The right skewness of these distributions is unsurprising; there should be more smaller diamonds than larger ones and these values can never be negative. More interestingly, there are spikes in the distribution at certain values. These spikes often, but not exclusively, occur near integer values. Without knowing more about diamond cutting, it is difficult to say more about what these spikes represent. Perhaps some diamond sizes are used more often than others, and these spikes correspond to those sizes. Also, it is possible that a diamond cut and carat value of a diamond imply values of x, y, and z. Since there are spikes in the distribution of carat sizes, and only a few different cuts, that could result in these spikes.

filter(diamonds, x > 0, x < 10) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = x), binwidth = 0.01) +
  scale_x_continuous(breaks = 1:10)

filter(diamonds, y > 0, y < 10) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = y), binwidth = 0.01) +
  scale_x_continuous(breaks = 1:10)

filter(diamonds, z > 0, z < 10) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = z), binwidth = 0.01) +
  scale_x_continuous(breaks = 1:10)

According to the documentation for diamonds, x is length, y is width, and z is depth. If documentation were unavailable, we should compare the values of the variables to match them to the length, width, and depth. It would be expected that length would be always be less than width, otherwise the length would be called the width. You could also search for the definitions of length, width, and depth with respect to diamond cuts. Depth can be expressed as a percentage of the length/width of the diamond, which means it should be less than both the length and the width.

summarize(diamonds, mean(x > y), mean(x > z), mean(y > z))
## # A tibble: 1 x 3
##   `mean(x > y)` `mean(x > z)` `mean(y > z)`
##           <dbl>         <dbl>         <dbl>
## 1         0.434          1.00          1.00

It appears that depth (z) is always smaller than length (x) or width (y), perhaps because a shallower depth helps when setting diamonds in jewelry and due to how it affect the reflection of light. Length is more than width in less than half the observations, the opposite of my expectations.