suppressPackageStartupMessages(library("tidyverse"))
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3
suppressPackageStartupMessages(library("viridis"))
package 㤼㸱viridis㤼㸲 was built under R version 3.6.2

1. Instead of summarizing the conditional distribution with a box plot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualization of the 2d distribution of carat and price?

Both cut_width() and cut_number() split a variable into groups. When using cut_width(), we need to choose the width, and the number of bins will be calculated automatically. When using cut_number(), we need to specify the number of bins, and the widths will be calculated automatically.

In either case, we want to choose the bin widths and number to be large enough to aggregate observations to remove noise, but not so large as to remove all the signal.

If categorical colors are used, no more than eight colors should be used in order to keep them distinct. Using cut_number, I will split carats into quantiles (five groups).

ggplot(
  data = diamonds,
  mapping = aes(color = cut_number(carat, 5), x = price)
) +
  geom_freqpoly() +
  labs(x = "Price", y = "Count", color = "Carat")

Alternatively, I could use cut_width to specify widths at which to cut. I will choose 1-carat widths. Since there are very few diamonds larger than 2-carats, this is not as informative. However, using a width of 0.5 carats creates too many groups, and splitting at non-whole numbers is unappealing.

ggplot(
  data = diamonds,
  mapping = aes(color = cut_width(carat, 1, boundary = 0), x = price)
) +
  geom_freqpoly() +
  labs(x = "Price", y = "Count", color = "Carat")

2. Visualize the distribution of carat, partitioned by price.

Plotted with a box plot with 10 bins with an equal number of observations, and the width determined by the number of observations.

ggplot(diamonds, aes(x = cut_number(price, 10), y = carat)) +
  geom_boxplot() +
  coord_flip() +
  xlab("Price")

Plotted with a box plot with 10 equal-width bins of $2,000. The argument boundary = 0 ensures that first bin is $0–$2,000.

ggplot(diamonds, aes(x = cut_width(price, 2000, boundary = 0), y = carat)) +
  geom_boxplot(varwidth = TRUE) +
  coord_flip() +
  xlab("Price")

3. How does the price distribution of very large diamonds compare to small diamonds. Is it as you expect, or does it surprise you?

The distribution of very large diamonds is more variable. I am not surprised, since I knew little about diamond prices. After the fact, it does not seem surprising (as many thing do). I would guess that this is due to the way in which diamonds are selected for retail sales. Suppose that someone selling a diamond only finds it profitable to sell it if some combination size, cut, clarity, and color are above a certain threshold. The smallest diamonds are only profitable to sell if they are exceptional in all the other factors (cut, clarity, and color), so the small diamonds sold have similar characteristics. However, larger diamonds may be profitable regardless of the values of the other factors. Thus we will observe large diamonds with a wider variety of cut, clarity, and color and thus more variability in prices.

4. Combine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.

There are many options to try, so your solutions may vary from mine. Here are a few options that I tried.

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_hex() +
  facet_wrap(~cut, ncol = 1) +
  scale_fill_viridis()


ggplot(diamonds, aes(x = cut_number(carat, 5), y = price, colour = cut)) +
  geom_boxplot()


ggplot(diamonds, aes(colour = cut_number(carat, 5), y = price, x = cut)) +
  geom_boxplot()

5. Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

Why is a scatterplot a better display than a binned plot for this case?

In this case, there is a strong relationship between \(x\) and \(y\). The outliers in this case are not extreme in either \(x\) or \(y\). A binned plot would not reveal these outliers, and may lead us to conclude that the largest value of \(x\) was an outlier even though it appears to fit the bivariate pattern well.

LS0tDQp0aXRsZTogIlR3byBjb250aW51b3VzIHZhcmlhYmxlcyINCm91dHB1dDogDQogIGh0bWxfbm90ZWJvb2s6DQogICAgdG9jOiB0cnVlDQogICAgdG9jX2Zsb2F0OiB0cnVlDQotLS0NCg0KYGBge3J9DQpzdXBwcmVzc1BhY2thZ2VTdGFydHVwTWVzc2FnZXMobGlicmFyeSgidGlkeXZlcnNlIikpDQpzdXBwcmVzc1BhY2thZ2VTdGFydHVwTWVzc2FnZXMobGlicmFyeSgidmlyaWRpcyIpKQ0KYGBgDQoNCiMjIyAxLiBJbnN0ZWFkIG9mIHN1bW1hcml6aW5nIHRoZSBjb25kaXRpb25hbCBkaXN0cmlidXRpb24gd2l0aCBhIGJveCBwbG90LCB5b3UgY291bGQgdXNlIGEgZnJlcXVlbmN5IHBvbHlnb24uIFdoYXQgZG8geW91IG5lZWQgdG8gY29uc2lkZXIgd2hlbiB1c2luZyBgY3V0X3dpZHRoKClgIHZzIGBjdXRfbnVtYmVyKClgPyBIb3cgZG9lcyB0aGF0IGltcGFjdCBhIHZpc3VhbGl6YXRpb24gb2YgdGhlIDJkIGRpc3RyaWJ1dGlvbiBvZiBgY2FyYXRgIGFuZCBgcHJpY2VgPw0KDQpCb3RoIGBjdXRfd2lkdGgoKWAgYW5kIGBjdXRfbnVtYmVyKClgIHNwbGl0IGEgdmFyaWFibGUgaW50byBncm91cHMuIFdoZW4gdXNpbmcgYGN1dF93aWR0aCgpYCwgd2UgbmVlZCB0byBjaG9vc2UgdGhlIHdpZHRoLCBhbmQgdGhlIG51bWJlciBvZiBiaW5zIHdpbGwgYmUgY2FsY3VsYXRlZCBhdXRvbWF0aWNhbGx5LiBXaGVuIHVzaW5nIGBjdXRfbnVtYmVyKClgLCB3ZSBuZWVkIHRvIHNwZWNpZnkgdGhlIG51bWJlciBvZiBiaW5zLCBhbmQgdGhlIHdpZHRocyB3aWxsIGJlIGNhbGN1bGF0ZWQgYXV0b21hdGljYWxseS4NCg0KSW4gZWl0aGVyIGNhc2UsIHdlIHdhbnQgdG8gY2hvb3NlIHRoZSBiaW4gd2lkdGhzIGFuZCBudW1iZXIgdG8gYmUgbGFyZ2UgZW5vdWdoIHRvIGFnZ3JlZ2F0ZSBvYnNlcnZhdGlvbnMgdG8gcmVtb3ZlIG5vaXNlLCBidXQgbm90IHNvIGxhcmdlIGFzIHRvIHJlbW92ZSBhbGwgdGhlIHNpZ25hbC4NCg0KSWYgY2F0ZWdvcmljYWwgY29sb3JzIGFyZSB1c2VkLCBubyBtb3JlIHRoYW4gZWlnaHQgY29sb3JzIHNob3VsZCBiZSB1c2VkIGluIG9yZGVyIHRvIGtlZXAgdGhlbSBkaXN0aW5jdC4gVXNpbmcgY3V0X251bWJlciwgSSB3aWxsIHNwbGl0IGNhcmF0cyBpbnRvIHF1YW50aWxlcyAoZml2ZSBncm91cHMpLg0KDQpgYGB7cn0NCmdncGxvdCgNCiAgZGF0YSA9IGRpYW1vbmRzLA0KICBtYXBwaW5nID0gYWVzKGNvbG9yID0gY3V0X251bWJlcihjYXJhdCwgNSksIHggPSBwcmljZSkNCikgKw0KICBnZW9tX2ZyZXFwb2x5KCkgKw0KICBsYWJzKHggPSAiUHJpY2UiLCB5ID0gIkNvdW50IiwgY29sb3IgPSAiQ2FyYXQiKQ0KYGBgDQoNCkFsdGVybmF0aXZlbHksIEkgY291bGQgdXNlIGBjdXRfd2lkdGhgIHRvIHNwZWNpZnkgd2lkdGhzIGF0IHdoaWNoIHRvIGN1dC4gSSB3aWxsIGNob29zZSAxLWNhcmF0IHdpZHRocy4gU2luY2UgdGhlcmUgYXJlIHZlcnkgZmV3IGRpYW1vbmRzIGxhcmdlciB0aGFuIDItY2FyYXRzLCB0aGlzIGlzIG5vdCBhcyBpbmZvcm1hdGl2ZS4gSG93ZXZlciwgdXNpbmcgYSB3aWR0aCBvZiAwLjUgY2FyYXRzIGNyZWF0ZXMgdG9vIG1hbnkgZ3JvdXBzLCBhbmQgc3BsaXR0aW5nIGF0IG5vbi13aG9sZSBudW1iZXJzIGlzIHVuYXBwZWFsaW5nLg0KDQpgYGB7cn0NCmdncGxvdCgNCiAgZGF0YSA9IGRpYW1vbmRzLA0KICBtYXBwaW5nID0gYWVzKGNvbG9yID0gY3V0X3dpZHRoKGNhcmF0LCAxLCBib3VuZGFyeSA9IDApLCB4ID0gcHJpY2UpDQopICsNCiAgZ2VvbV9mcmVxcG9seSgpICsNCiAgbGFicyh4ID0gIlByaWNlIiwgeSA9ICJDb3VudCIsIGNvbG9yID0gIkNhcmF0IikNCmBgYA0KDQojIyMgMi4gVmlzdWFsaXplIHRoZSBkaXN0cmlidXRpb24gb2YgYGNhcmF0YCwgcGFydGl0aW9uZWQgYnkgYHByaWNlYC4NCg0KUGxvdHRlZCB3aXRoIGEgYm94IHBsb3Qgd2l0aCAxMCBiaW5zIHdpdGggYW4gZXF1YWwgbnVtYmVyIG9mIG9ic2VydmF0aW9ucywgYW5kIHRoZSB3aWR0aCBkZXRlcm1pbmVkIGJ5IHRoZSBudW1iZXIgb2Ygb2JzZXJ2YXRpb25zLg0KDQpgYGB7cn0NCmdncGxvdChkaWFtb25kcywgYWVzKHggPSBjdXRfbnVtYmVyKHByaWNlLCAxMCksIHkgPSBjYXJhdCkpICsNCiAgZ2VvbV9ib3hwbG90KCkgKw0KICBjb29yZF9mbGlwKCkgKw0KICB4bGFiKCJQcmljZSIpDQpgYGANCg0KUGxvdHRlZCB3aXRoIGEgYm94IHBsb3Qgd2l0aCAxMCBlcXVhbC13aWR0aCBiaW5zIG9mICQyLDAwMC4gVGhlIGFyZ3VtZW50IGJvdW5kYXJ5ID0gMCBlbnN1cmVzIHRoYXQgZmlyc3QgYmluIGlzICQw4oCTJDIsMDAwLg0KDQpgYGB7cn0NCmdncGxvdChkaWFtb25kcywgYWVzKHggPSBjdXRfd2lkdGgocHJpY2UsIDIwMDAsIGJvdW5kYXJ5ID0gMCksIHkgPSBjYXJhdCkpICsNCiAgZ2VvbV9ib3hwbG90KHZhcndpZHRoID0gVFJVRSkgKw0KICBjb29yZF9mbGlwKCkgKw0KICB4bGFiKCJQcmljZSIpDQpgYGANCg0KDQojIyMgMy4gSG93IGRvZXMgdGhlIHByaWNlIGRpc3RyaWJ1dGlvbiBvZiB2ZXJ5IGxhcmdlIGRpYW1vbmRzIGNvbXBhcmUgdG8gc21hbGwgZGlhbW9uZHMuIElzIGl0IGFzIHlvdSBleHBlY3QsIG9yIGRvZXMgaXQgc3VycHJpc2UgeW91Pw0KDQpUaGUgZGlzdHJpYnV0aW9uIG9mIHZlcnkgbGFyZ2UgZGlhbW9uZHMgaXMgbW9yZSB2YXJpYWJsZS4gSSBhbSBub3Qgc3VycHJpc2VkLCBzaW5jZSBJIGtuZXcgbGl0dGxlIGFib3V0IGRpYW1vbmQgcHJpY2VzLiBBZnRlciB0aGUgZmFjdCwgaXQgZG9lcyBub3Qgc2VlbSBzdXJwcmlzaW5nIChhcyBtYW55IHRoaW5nIGRvKS4gSSB3b3VsZCBndWVzcyB0aGF0IHRoaXMgaXMgZHVlIHRvIHRoZSB3YXkgaW4gd2hpY2ggZGlhbW9uZHMgYXJlIHNlbGVjdGVkIGZvciByZXRhaWwgc2FsZXMuIFN1cHBvc2UgdGhhdCBzb21lb25lIHNlbGxpbmcgYSBkaWFtb25kIG9ubHkgZmluZHMgaXQgcHJvZml0YWJsZSB0byBzZWxsIGl0IGlmIHNvbWUgY29tYmluYXRpb24gc2l6ZSwgY3V0LCBjbGFyaXR5LCBhbmQgY29sb3IgYXJlIGFib3ZlIGEgY2VydGFpbiB0aHJlc2hvbGQuIFRoZSBzbWFsbGVzdCBkaWFtb25kcyBhcmUgb25seSBwcm9maXRhYmxlIHRvIHNlbGwgaWYgdGhleSBhcmUgZXhjZXB0aW9uYWwgaW4gYWxsIHRoZSBvdGhlciBmYWN0b3JzIChjdXQsIGNsYXJpdHksIGFuZCBjb2xvciksIHNvIHRoZSBzbWFsbCBkaWFtb25kcyBzb2xkIGhhdmUgc2ltaWxhciBjaGFyYWN0ZXJpc3RpY3MuIEhvd2V2ZXIsIGxhcmdlciBkaWFtb25kcyBtYXkgYmUgcHJvZml0YWJsZSByZWdhcmRsZXNzIG9mIHRoZSB2YWx1ZXMgb2YgdGhlIG90aGVyIGZhY3RvcnMuIFRodXMgd2Ugd2lsbCBvYnNlcnZlIGxhcmdlIGRpYW1vbmRzIHdpdGggYSB3aWRlciB2YXJpZXR5IG9mIGN1dCwgY2xhcml0eSwgYW5kIGNvbG9yIGFuZCB0aHVzIG1vcmUgdmFyaWFiaWxpdHkgaW4gcHJpY2VzLg0KDQojIyMgNC4gQ29tYmluZSB0d28gb2YgdGhlIHRlY2huaXF1ZXMgeW914oCZdmUgbGVhcm5lZCB0byB2aXN1YWxpemUgdGhlIGNvbWJpbmVkIGRpc3RyaWJ1dGlvbiBvZiBjdXQsIGNhcmF0LCBhbmQgcHJpY2UuDQoNClRoZXJlIGFyZSBtYW55IG9wdGlvbnMgdG8gdHJ5LCBzbyB5b3VyIHNvbHV0aW9ucyBtYXkgdmFyeSBmcm9tIG1pbmUuIEhlcmUgYXJlIGEgZmV3IG9wdGlvbnMgdGhhdCBJIHRyaWVkLg0KDQpgYGB7cn0NCmdncGxvdChkaWFtb25kcywgYWVzKHggPSBjYXJhdCwgeSA9IHByaWNlKSkgKw0KICBnZW9tX2hleCgpICsNCiAgZmFjZXRfd3JhcCh+Y3V0LCBuY29sID0gMSkgKw0KICBzY2FsZV9maWxsX3ZpcmlkaXMoKQ0KDQpnZ3Bsb3QoZGlhbW9uZHMsIGFlcyh4ID0gY3V0X251bWJlcihjYXJhdCwgNSksIHkgPSBwcmljZSwgY29sb3VyID0gY3V0KSkgKw0KICBnZW9tX2JveHBsb3QoKQ0KDQpnZ3Bsb3QoZGlhbW9uZHMsIGFlcyhjb2xvdXIgPSBjdXRfbnVtYmVyKGNhcmF0LCA1KSwgeSA9IHByaWNlLCB4ID0gY3V0KSkgKw0KICBnZW9tX2JveHBsb3QoKQ0KYGBgDQoNCg0KIyMjIDUuIFR3byBkaW1lbnNpb25hbCBwbG90cyByZXZlYWwgb3V0bGllcnMgdGhhdCBhcmUgbm90IHZpc2libGUgaW4gb25lIGRpbWVuc2lvbmFsIHBsb3RzLiBGb3IgZXhhbXBsZSwgc29tZSBwb2ludHMgaW4gdGhlIHBsb3QgYmVsb3cgaGF2ZSBhbiB1bnVzdWFsIGNvbWJpbmF0aW9uIG9mIGB4YCBhbmQgYHlgIHZhbHVlcywgd2hpY2ggbWFrZXMgdGhlIHBvaW50cyBvdXRsaWVycyBldmVuIHRob3VnaCB0aGVpciBgeGAgYW5kIGB5YCB2YWx1ZXMgYXBwZWFyIG5vcm1hbCB3aGVuIGV4YW1pbmVkIHNlcGFyYXRlbHkuDQoNCmBgYHtyfQ0KZ2dwbG90KGRhdGEgPSBkaWFtb25kcykgKw0KICBnZW9tX3BvaW50KG1hcHBpbmcgPSBhZXMoeCA9IHgsIHkgPSB5KSkgKw0KICBjb29yZF9jYXJ0ZXNpYW4oeGxpbSA9IGMoNCwgMTEpLCB5bGltID0gYyg0LCAxMSkpDQpgYGANCg0KV2h5IGlzIGEgc2NhdHRlcnBsb3QgYSBiZXR0ZXIgZGlzcGxheSB0aGFuIGEgYmlubmVkIHBsb3QgZm9yIHRoaXMgY2FzZT8NCg0KSW4gdGhpcyBjYXNlLCB0aGVyZSBpcyBhIHN0cm9uZyByZWxhdGlvbnNoaXAgYmV0d2VlbiAkeCQgYW5kICR5JC4gVGhlIG91dGxpZXJzIGluIHRoaXMgY2FzZSBhcmUgbm90IGV4dHJlbWUgaW4gZWl0aGVyICR4JCBvciAkeSQuIEEgYmlubmVkIHBsb3Qgd291bGQgbm90IHJldmVhbCB0aGVzZSBvdXRsaWVycywgYW5kIG1heSBsZWFkIHVzIHRvIGNvbmNsdWRlIHRoYXQgdGhlIGxhcmdlc3QgdmFsdWUgb2YgJHgkIHdhcyBhbiBvdXRsaWVyIGV2ZW4gdGhvdWdoIGl0IGFwcGVhcnMgdG8gZml0IHRoZSBiaXZhcmlhdGUgcGF0dGVybiB3ZWxsLg==