suppressPackageStartupMessages(library("tidyverse"))
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3

1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

First, I’ll calculate summary statistics for these variables and plot their distributions.

summary(select(diamonds, x, y, z))
       x                y                z         
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
 Median : 5.700   Median : 5.710   Median : 3.530  
 Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
 3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
 Max.   :10.740   Max.   :58.900   Max.   :31.800  
ggplot(diamonds) +
  geom_histogram(mapping = aes(x = x), binwidth = 0.01)


ggplot(diamonds) +
  geom_histogram(mapping = aes(x = y), binwidth = 0.01)


ggplot(diamonds) +
  geom_histogram(mapping = aes(x = z), binwidth = 0.01)

There several noticeable features of the distributions:

  1. x and y are larger than z,
  2. there are outliers,
  3. they are all right skewed, and
  4. they are multimodal or “spiky”.

The typical values of x and y are larger than z, with x and y having inter-quartile ranges of 4.7-5.7, while z has an inter-quartile range of 2.9-4.0.

There are two types of outliers in this data. Some diamonds have values of zero and some have abnormally large values of x, y, or z.

summary(select(diamonds, x, y, z))
       x                y                z         
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
 Median : 5.700   Median : 5.710   Median : 3.530  
 Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
 3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
 Max.   :10.740   Max.   :58.900   Max.   :31.800  

These appear to be either data entry errors, or an undocumented convention in the dataset for indicating missing values. An alternative hypothesis would be that values of zero are the result of rounding values like 0.002 down, but since there are no diamonds with values of 0.01, that does not seem to be the case.

filter(diamonds, x == 0 | y == 0 | z == 0)

There are also some diamonds with values of y and z that are abnormally large. There are diamonds with y == 58.9 and y == 31.8, and one with z == 31.8. These are probably data errors since the values do not seem in line with the values of the other variables.

diamonds %>%
  arrange(desc(y)) %>%
  head()

diamonds %>%
  arrange(desc(z)) %>%
  head()

Initially, I only considered univariate outliers. However, to check the plausibility of those outliers I would informally consider how consistent their values are with the values of the other variables. In this case, scatter plots of each combination of x, y, and z shows these outliers much more clearly.

ggplot(diamonds, aes(x = x, y = y)) +
  geom_point()


ggplot(diamonds, aes(x = x, y = z)) +
  geom_point()


ggplot(diamonds, aes(x = y, y = z)) +
  geom_point()

Removing the outliers from x, y, and z makes the distribution easier to see. The right skewness of these distributions is unsurprising; there should be more smaller diamonds than larger ones and these values can never be negative. More interestingly, there are spikes in the distribution at certain values. These spikes often, but not exclusively, occur near integer values. Without knowing more about diamond cutting, I can’t say more about what these spikes represent. If you know, add a comment. I would guess that some diamond sizes are used more often than others, and these spikes correspond to those sizes. Also, I would guess that a diamond cut and carat value of a diamond imply values of x, y, and z. Since there are spikes in the distribution of carat sizes, and only a few different cuts, that could result in these spikes. I’ll leave it to you to figure out if that’s the case.

filter(diamonds, x > 0, x < 10) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = x), binwidth = 0.01) +
  scale_x_continuous(breaks = 1:10)


filter(diamonds, y > 0, y < 10) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = y), binwidth = 0.01) +
  scale_x_continuous(breaks = 1:10)


filter(diamonds, z > 0, z < 10) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = z), binwidth = 0.01) +
  scale_x_continuous(breaks = 1:10)

According to the documentation for diamonds, x is length, y is width, and z is depth. If documentation were unavailable, I would compare the values of the variables to match them to the length, width, and depth. I would expect length to always be less than width, otherwise the length would be called the width. I would also search for the definitions of length, width, and depth with respect to diamond cuts. Depth can be expressed as a percentage of the length/width of the diamond, which means it should be less than both the length and the width.

summarise(diamonds, mean(x > y), mean(x > z), mean(y > z))

It appears that depth (z) is always smaller than length (x) or width (y), perhaps because a shallower depth helps when setting diamonds in jewelry and due to how it affect the reflection of light. Length is more than width in less than half the observations, the opposite of my expectations.

2. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

  • The price data has many spikes, but I can’t tell what each spike corresponds to. The following plots don’t show much difference in the distributions in the last one or two digits.
  • There are no diamonds with a price of $1,500 (between $1,450 and $1,550).
  • There’s a bulge in the distribution around $750.
ggplot(filter(diamonds, price < 2500), aes(x = price)) +
  geom_histogram(binwidth = 10, center = 0)


ggplot(filter(diamonds), aes(x = price)) +
  geom_histogram(binwidth = 100, center = 0)

Distribution of last digit

diamonds %>%
  mutate(ending = price %% 10) %>%
  ggplot(aes(x = ending)) +
  geom_histogram(binwidth = 1, center = 0)


diamonds %>%
  mutate(ending = price %% 100) %>%
  ggplot(aes(x = ending)) +
  geom_histogram(binwidth = 1)


diamonds %>%
  mutate(ending = price %% 1000) %>%
  filter(ending >= 500, ending <= 800) %>%
  ggplot(aes(x = ending)) +
  geom_histogram(binwidth = 1)

3. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

There are more than 70 times as many 1 carat diamonds as 0.99 carat diamond.

diamonds %>%
  filter(carat >= 0.99, carat <= 1) %>%
  count(carat)

I don’t know exactly the process behind how carats are measured, but some way or another some diamonds carat values are being “rounded up” Presumably there is a premium for a 1 carat diamond vs. a 0.99 carat diamond beyond the expected increase in price due to a 0.01 carat increase.

To check this intuition, we would want to look at the number of diamonds in each carat range to see if there is an unusually low number of 0.99 carat diamonds, and an abnormally large number of 1 carat diamonds.

diamonds %>%
  filter(carat >= 0.9, carat <= 1.1) %>%
  count(carat) %>%
  print(n = Inf)

4. Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

The coord_cartesian() function zooms in on the area specified by the limits, after having calculated and drawn the geoms. Since the histogram bins have already been calculated, it is unaffected.

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  coord_cartesian(xlim = c(100, 5000), ylim = c(0, 3000))

However, the xlim() and ylim() functions influence actions before the calculation of the stats related to the histogram. Thus, any values outside the x- and y-limits are dropped before calculating bin widths and counts. This can influence how the histogram looks.

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  xlim(100, 5000) +
  ylim(0, 3000)

---
title: "Exploratory Data Analysis: Variation"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

```{r loadlibrary}
suppressPackageStartupMessages(library("tidyverse"))
```

### 1. Explore the distribution of each of the `x`, `y`, and `z` variables in `diamonds`. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

First, I’ll calculate summary statistics for these variables and plot their distributions.

```{r}
summary(select(diamonds, x, y, z))

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = x), binwidth = 0.01)

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = y), binwidth = 0.01)

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = z), binwidth = 0.01)
```

There several noticeable features of the distributions:

1. `x` and `y` are larger than `z`,
2. there are outliers,
3. they are all right skewed, and
4. they are multimodal or “spiky”.

The typical values of `x` and `y` are larger than `z`, with `x` and `y` having inter-quartile ranges of 4.7-5.7, while `z` has an inter-quartile range of 2.9-4.0.

There are two types of outliers in this data. Some diamonds have values of zero and some have abnormally large values of `x`, `y`, or `z`.

```{r}
summary(select(diamonds, x, y, z))
```

These appear to be either data entry errors, or an undocumented convention in the dataset for indicating missing values. An alternative hypothesis would be that values of zero are the result of rounding values like `0.002` down, but since there are no diamonds with values of 0.01, that does not seem to be the case.

```{r}
filter(diamonds, x == 0 | y == 0 | z == 0)
```

There are also some diamonds with values of `y` and `z` that are abnormally large. There are diamonds with `y == 58.9` and `y == 31.8`, and one with `z == 31.8`. These are probably data errors since the values do not seem in line with the values of the other variables.

```{r}
diamonds %>%
  arrange(desc(y)) %>%
  head()

diamonds %>%
  arrange(desc(z)) %>%
  head()
```

Initially, I only considered univariate outliers. However, to check the plausibility of those outliers I would informally consider how consistent their values are with the values of the other variables. In this case, scatter plots of each combination of `x`, `y`, and `z` shows these outliers much more clearly.

```{r}
ggplot(diamonds, aes(x = x, y = y)) +
  geom_point()

ggplot(diamonds, aes(x = x, y = z)) +
  geom_point()

ggplot(diamonds, aes(x = y, y = z)) +
  geom_point()
```

Removing the outliers from `x`, `y`, and `z` makes the distribution easier to see. The right skewness of these distributions is unsurprising; there should be more smaller diamonds than larger ones and these values can never be negative. More interestingly, there are spikes in the distribution at certain values. These spikes often, but not exclusively, occur near integer values. Without knowing more about diamond cutting, I can’t say more about what these spikes represent. If you know, add a comment. I would guess that some diamond sizes are used more often than others, and these spikes correspond to those sizes. Also, I would guess that a diamond cut and carat value of a diamond imply values of `x`, `y`, and `z`. Since there are spikes in the distribution of carat sizes, and only a few different cuts, that could result in these spikes. I’ll leave it to you to figure out if that’s the case.

```{r}
filter(diamonds, x > 0, x < 10) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = x), binwidth = 0.01) +
  scale_x_continuous(breaks = 1:10)

filter(diamonds, y > 0, y < 10) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = y), binwidth = 0.01) +
  scale_x_continuous(breaks = 1:10)

filter(diamonds, z > 0, z < 10) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = z), binwidth = 0.01) +
  scale_x_continuous(breaks = 1:10)
```

According to the documentation for [diamonds](https://ggplot2.tidyverse.org/reference/diamonds.html), `x` is length, `y` is width, and `z` is depth. If documentation were unavailable, I would compare the values of the variables to match them to the length, width, and depth. I would expect length to always be less than width, otherwise the length would be called the width. I would also search for the definitions of length, width, and depth with respect to diamond cuts. [Depth](https://en.wikipedia.org/wiki/Diamond_cut) can be expressed as a percentage of the length/width of the diamond, which means it should be less than both the length and the width.

```{r}
summarise(diamonds, mean(x > y), mean(x > z), mean(y > z))
```

It appears that depth (`z`) is always smaller than length (`x`) or width (`y`), perhaps because a shallower depth helps when setting diamonds in jewelry and due to how it affect the reflection of light. Length is more than width in less than half the observations, the opposite of my expectations.

### 2. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the `binwidth` and make sure you try a wide range of values.)

 - The price data has many spikes, but I can’t tell what each spike corresponds to. The following plots don’t show much difference in the distributions in the last one or two digits.
 - There are no diamonds with a price of $1,500 (between $1,450 and $1,550).
 - There’s a bulge in the distribution around $750.

```{r}
ggplot(filter(diamonds, price < 2500), aes(x = price)) +
  geom_histogram(binwidth = 10, center = 0)

ggplot(filter(diamonds), aes(x = price)) +
  geom_histogram(binwidth = 100, center = 0)
```

Distribution of last digit

```{r}
diamonds %>%
  mutate(ending = price %% 10) %>%
  ggplot(aes(x = ending)) +
  geom_histogram(binwidth = 1, center = 0)

diamonds %>%
  mutate(ending = price %% 100) %>%
  ggplot(aes(x = ending)) +
  geom_histogram(binwidth = 1)

diamonds %>%
  mutate(ending = price %% 1000) %>%
  filter(ending >= 500, ending <= 800) %>%
  ggplot(aes(x = ending)) +
  geom_histogram(binwidth = 1)
```

### 3. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

There are more than 70 times as many 1 carat diamonds as 0.99 carat diamond.

```{r}
diamonds %>%
  filter(carat >= 0.99, carat <= 1) %>%
  count(carat)
```

I don’t know exactly the process behind how carats are measured, but some way or another some diamonds carat values are being “rounded up” Presumably there is a premium for a 1 carat diamond vs. a 0.99 carat diamond beyond the expected increase in price due to a 0.01 carat increase.

To check this intuition, we would want to look at the number of diamonds in each carat range to see if there is an unusually low number of 0.99 carat diamonds, and an abnormally large number of 1 carat diamonds.

```{r}
diamonds %>%
  filter(carat >= 0.9, carat <= 1.1) %>%
  count(carat) %>%
  print(n = Inf)
```


### 4. Compare and contrast `coord_cartesian()` vs `xlim()` or `ylim()` when zooming in on a histogram. What happens if you leave `binwidth` unset? What happens if you try and zoom so only half a bar shows?

The `coord_cartesian()` function zooms in on the area specified by the limits, after having calculated and drawn the geoms. Since the histogram bins have already been calculated, it is unaffected.

```{r}
ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  coord_cartesian(xlim = c(100, 5000), ylim = c(0, 3000))
```

However, the `xlim()` and `ylim()` functions influence actions before the calculation of the stats related to the histogram. Thus, any values outside the x- and y-limits are dropped before calculating bin widths and counts. This can influence how the histogram looks.

```{r}
ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  xlim(100, 5000) +
  ylim(0, 3000)
```

