1. Explore the distribution of each of the x
, y
, and z
variables in diamonds
. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
First, I’ll calculate summary statistics for these variables and plot their distributions.
summary(select(diamonds, x, y, z))
x y z
Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
Median : 5.700 Median : 5.710 Median : 3.530
Mean : 5.731 Mean : 5.735 Mean : 3.539
3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
Max. :10.740 Max. :58.900 Max. :31.800
ggplot(diamonds) +
geom_histogram(mapping = aes(x = x), binwidth = 0.01)

ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.01)

ggplot(diamonds) +
geom_histogram(mapping = aes(x = z), binwidth = 0.01)

There several noticeable features of the distributions:
x
and y
are larger than z
,
- there are outliers,
- they are all right skewed, and
- they are multimodal or “spiky”.
The typical values of x
and y
are larger than z
, with x
and y
having inter-quartile ranges of 4.7-5.7, while z
has an inter-quartile range of 2.9-4.0.
There are two types of outliers in this data. Some diamonds have values of zero and some have abnormally large values of x
, y
, or z
.
summary(select(diamonds, x, y, z))
x y z
Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
Median : 5.700 Median : 5.710 Median : 3.530
Mean : 5.731 Mean : 5.735 Mean : 3.539
3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
Max. :10.740 Max. :58.900 Max. :31.800
These appear to be either data entry errors, or an undocumented convention in the dataset for indicating missing values. An alternative hypothesis would be that values of zero are the result of rounding values like 0.002
down, but since there are no diamonds with values of 0.01, that does not seem to be the case.
filter(diamonds, x == 0 | y == 0 | z == 0)
There are also some diamonds with values of y
and z
that are abnormally large. There are diamonds with y == 58.9
and y == 31.8
, and one with z == 31.8
. These are probably data errors since the values do not seem in line with the values of the other variables.
diamonds %>%
arrange(desc(y)) %>%
head()
diamonds %>%
arrange(desc(z)) %>%
head()
Initially, I only considered univariate outliers. However, to check the plausibility of those outliers I would informally consider how consistent their values are with the values of the other variables. In this case, scatter plots of each combination of x
, y
, and z
shows these outliers much more clearly.
ggplot(diamonds, aes(x = x, y = y)) +
geom_point()

ggplot(diamonds, aes(x = x, y = z)) +
geom_point()

ggplot(diamonds, aes(x = y, y = z)) +
geom_point()

Removing the outliers from x
, y
, and z
makes the distribution easier to see. The right skewness of these distributions is unsurprising; there should be more smaller diamonds than larger ones and these values can never be negative. More interestingly, there are spikes in the distribution at certain values. These spikes often, but not exclusively, occur near integer values. Without knowing more about diamond cutting, I can’t say more about what these spikes represent. If you know, add a comment. I would guess that some diamond sizes are used more often than others, and these spikes correspond to those sizes. Also, I would guess that a diamond cut and carat value of a diamond imply values of x
, y
, and z
. Since there are spikes in the distribution of carat sizes, and only a few different cuts, that could result in these spikes. I’ll leave it to you to figure out if that’s the case.
filter(diamonds, x > 0, x < 10) %>%
ggplot() +
geom_histogram(mapping = aes(x = x), binwidth = 0.01) +
scale_x_continuous(breaks = 1:10)

filter(diamonds, y > 0, y < 10) %>%
ggplot() +
geom_histogram(mapping = aes(x = y), binwidth = 0.01) +
scale_x_continuous(breaks = 1:10)

filter(diamonds, z > 0, z < 10) %>%
ggplot() +
geom_histogram(mapping = aes(x = z), binwidth = 0.01) +
scale_x_continuous(breaks = 1:10)

According to the documentation for diamonds, x
is length, y
is width, and z
is depth. If documentation were unavailable, I would compare the values of the variables to match them to the length, width, and depth. I would expect length to always be less than width, otherwise the length would be called the width. I would also search for the definitions of length, width, and depth with respect to diamond cuts. Depth can be expressed as a percentage of the length/width of the diamond, which means it should be less than both the length and the width.
summarise(diamonds, mean(x > y), mean(x > z), mean(y > z))
It appears that depth (z
) is always smaller than length (x
) or width (y
), perhaps because a shallower depth helps when setting diamonds in jewelry and due to how it affect the reflection of light. Length is more than width in less than half the observations, the opposite of my expectations.
