suppressPackageStartupMessages(library("tidyverse"))
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3
suppressPackageStartupMessages(library("nycflights13"))
package 㤼㸱nycflights13㤼㸲 was built under R version 3.6.3
#These additional package provide functions that will be used in answering some questions
suppressPackageStartupMessages(library("ggstance"))
package 㤼㸱ggstance㤼㸲 was built under R version 3.6.3
suppressPackageStartupMessages(library("lvplot"))
package 㤼㸱lvplot㤼㸲 was built under R version 3.6.3
suppressPackageStartupMessages(library("ggbeeswarm"))
package 㤼㸱ggbeeswarm㤼㸲 was built under R version 3.6.3

1. Use what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.

Instead of a freqplot use a box-plot

nycflights13::flights %>%
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>%
  ggplot() +
  geom_boxplot(mapping = aes(y = sched_dep_time, x = cancelled))

2. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

What are the general relationships of each variable with the price of the diamonds? I will consider the variables: carat, clarity, color, and cut. I ignore the dimensions of the diamond since carat measures size, and thus incorporates most of the information contained in these variables.

Since both price and carat are continuous variables, I use a scatter plot to visualize their relationship.

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point()

However, since there is a large number of points in the data, I will use a boxplot by binning carat (as suggested in the chapter).

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

Note that the choice of the binning width is important, as if it were too large it would obscure any relationship, and if it were too small, the values in the bins could be too variable to reveal underlying trends.

The variables color and clarity are ordered categorical variables. The chapter suggests visualizing a categorical and continuous variable using frequency polygons or boxplots. In this case, I will use a box plot since it will better show a relationship between the variables.

There is a weak negative relationship between color and price. The scale of diamond color goes from D (best) to J (worst). Currently, the levels of diamonds$color are in the wrong order. Before plotting, I will reverse the order of the color levels so they will be in increasing order of quality on the x-axis. The color column is an example of a factor variable.

diamonds %>%
  mutate(color = fct_rev(color)) %>%
  ggplot(aes(x = color, y = price)) +
  geom_boxplot()

There is also weak negative relationship between clarity and price. The scale of clarity goes from I1 (worst) to IF (best).

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = clarity, y = price))

For both clarity and color, there is a much larger amount of variation within each category than between categories. Carat is clearly the single best predictor of diamond prices.

Now that we have established that carat appears to be the best predictor of price, what is the relationship between it and cut? Since this is an example of a continuous (carat) and categorical (cut) variable, it can be visualized with a box plot.

ggplot(diamonds, aes(x = cut, y = carat)) +
  geom_boxplot()

There is a lot of variability in the distribution of carat sizes within each cut category. There is a slight negative relationship between carat and cut. Noticeably, the largest carat diamonds have a cut of “Fair” (the lowest).

This negative relationship can be due to the way in which diamonds are selected for sale. A larger diamond can be profitably sold with a lower quality cut, while a smaller diamond requires a better cut.

3. Install the ggstance package, and create a horizontal box plot. How does this compare to using coord_flip()?

Earlier, we created this horizontal box plot of the distribution hwy by class, using geom_boxplot() and coord_flip():

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip()

In this case the output looks the same, but x and y aesthetics are flipped.

#geom_boxploth() in ggstance library
ggplot(data = mpg) +
  geom_boxploth(mapping = aes(y = reorder(class, hwy, FUN = median), x = hwy))

4. One problem with box plots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn?

How do you interpret the plots?

Like box-plots, the boxes of the letter-value plot correspond to quantiles. However, they incorporate far more quantiles than box-plots. They are useful for larger datasets because,

  1. larger datasets can give precise estimates of quantiles beyond the quartiles, and
  2. in expectation, larger datasets should have more outliers (in absolute numbers).
#geom_lv() in lvplot library
ggplot(diamonds, aes(x = cut, y = price)) +
  geom_lv()

The letter-value plot is described in Hofmann, Wickham, and Kafadar ( 2017 ).

5. Compare and contrast geom_violin() with a faceted geom_histogram(), or a colored geom_freqpoly(). What are the pros and cons of each method?

I produce plots for these three methods below. The geom_freqpoly() is better for look-up: meaning that given a price, it is easy to tell which cut has the highest density. However, the overlapping lines makes it difficult to distinguish how the overall distributions relate to each other. The geom_violin() and faceted geom_histogram() have similar strengths and weaknesses. It is easy to visually distinguish differences in the overall shape of the distributions (skewness, central values, variance, etc). However, since we can’t easily compare the vertical values of the distribution, it is difficult to look up which category has the highest density for a given price. All of these methods depend on tuning parameters to determine the level of smoothness of the distribution.

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
  geom_freqpoly(mapping = aes(color = cut), binwidth = 500)


ggplot(data = diamonds, mapping = aes(x = price)) +
  geom_histogram() +
  facet_wrap(~cut, ncol = 1, scales = "free_y")


ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_violin() +
  coord_flip()

The violin plot was first described in Hintze and Nelson ( 1998 ).

6. If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

There are two methods:

  • geom_quasirandom() produces plots that are a mix of jitter and violin plots. There are several different methods that determine exactly how the random location of the points is generated.
  • geom_beeswarm() produces a plot similar to a violin plot, but by offsetting the points.

I’ll use the mpg box plot example since these methods display individual points, they are better suited for smaller datasets.

ggplot(data = mpg) +
  geom_quasirandom(mapping = aes(
    x = reorder(class, hwy, FUN = median),
    y = hwy
  ))


ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "tukey"
  )


ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "tukeyDense"
  )


ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "frowney"
  )


ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "smiley"
  )


ggplot(data = mpg) +
  geom_beeswarm(mapping = aes(
    x = reorder(class, hwy, FUN = median),
    y = hwy
  ))

---
title: "EDA Covariation"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

```{r loadlibrary}
suppressPackageStartupMessages(library("tidyverse"))
suppressPackageStartupMessages(library("nycflights13"))
#These additional package provide functions that will be used in answering some questions
suppressPackageStartupMessages(library("ggstance"))
suppressPackageStartupMessages(library("lvplot"))
suppressPackageStartupMessages(library("ggbeeswarm"))
```

### 1. Use what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.

Instead of a `freqplot` use a box-plot

```{r}
nycflights13::flights %>%
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>%
  ggplot() +
  geom_boxplot(mapping = aes(y = sched_dep_time, x = cancelled))
```

### 2. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

What are the general relationships of each variable with the price of the diamonds? I will consider the variables: `carat`, `clarity`, `color`, and `cut`. I ignore the dimensions of the diamond since `carat` measures size, and thus incorporates most of the information contained in these variables.

Since both `price` and `carat` are continuous variables, I use a scatter plot to visualize their relationship.

```{r}
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point()
```

However, since there is a large number of points in the data, I will use a boxplot by binning `carat` (as suggested in the chapter).

```{r}
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
```

Note that the choice of the binning width is important, as if it were too large it would obscure any relationship, and if it were too small, the values in the bins could be too variable to reveal underlying trends.

The variables `color` and `clarity` are ordered categorical variables. The chapter suggests visualizing a categorical and continuous variable using frequency polygons or boxplots. In this case, I will use a box plot since it will better show a relationship between the variables.

There is a weak negative relationship between `color` and `price`. The scale of diamond color goes from D (best) to J (worst). Currently, the levels of `diamonds$color` are in the wrong order. Before plotting, I will reverse the order of the `color` levels so they will be in increasing order of quality on the x-axis. The `color` column is an example of a factor variable.

```{r}
diamonds %>%
  mutate(color = fct_rev(color)) %>%
  ggplot(aes(x = color, y = price)) +
  geom_boxplot()
```

There is also weak negative relationship between `clarity` and `price`. The scale of clarity goes from I1 (worst) to IF (best).

```{r}
ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = clarity, y = price))
```

For both `clarity` and `color`, there is a much larger amount of variation within each category than between categories. Carat is clearly the single best predictor of diamond prices.

Now that we have established that carat appears to be the best predictor of price, what is the relationship between it and cut? Since this is an example of a continuous (carat) and categorical (cut) variable, it can be visualized with a box plot.

```{r}
ggplot(diamonds, aes(x = cut, y = carat)) +
  geom_boxplot()
```

There is a lot of variability in the distribution of carat sizes within each cut category. There is a slight negative relationship between carat and cut. Noticeably, the largest carat diamonds have a cut of “Fair” (the lowest).

This negative relationship can be due to the way in which diamonds are selected for sale. A larger diamond can be profitably sold with a lower quality cut, while a smaller diamond requires a better cut.

### 3. Install the ggstance package, and create a horizontal box plot. How does this compare to using coord_flip()?

Earlier, we created this horizontal box plot of the distribution hwy by class, using `geom_boxplot()` and `coord_flip()`:

```{r}
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip()
```

In this case the output looks the same, but `x` and `y` aesthetics are flipped.

```{r}
#geom_boxploth() in ggstance library
ggplot(data = mpg) +
  geom_boxploth(mapping = aes(y = reorder(class, hwy, FUN = median), x = hwy))
```

### 4. One problem with box plots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the [lvplot](https://www.rdocumentation.org/packages/lvplot/versions/0.2.0) package, and try using `geom_lv()` to display the distribution of price vs cut. What do you learn?

How do you interpret the plots?

Like box-plots, the boxes of the letter-value plot correspond to quantiles. However, they incorporate far more quantiles than box-plots. They are useful for larger datasets because,

1. larger datasets can give precise estimates of quantiles beyond the quartiles, and
2. in expectation, larger datasets should have more outliers (in absolute numbers).

```{r}
#geom_lv() in lvplot library
ggplot(diamonds, aes(x = cut, y = price)) +
  geom_lv()
```

The letter-value plot is described in Hofmann, Wickham, and Kafadar ( [2017](https://doi.org/10.1080/10618600.2017.1305277) ).

### 5. Compare and contrast `geom_violin()` with a faceted `geom_histogram()`, or a colored `geom_freqpoly()`. What are the pros and cons of each method?

I produce plots for these three methods below. The `geom_freqpoly()` is better for look-up: meaning that given a price, it is easy to tell which `cut` has the highest density. However, the overlapping lines makes it difficult to distinguish how the overall distributions relate to each other. The `geom_violin()` and faceted `geom_histogram()` have similar strengths and weaknesses. It is easy to visually distinguish differences in the overall shape of the distributions (skewness, central values, variance, etc). However, since we can’t easily compare the vertical values of the distribution, it is difficult to look up which category has the highest density for a given price. All of these methods depend on tuning parameters to determine the level of smoothness of the distribution.

```{r}
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
  geom_freqpoly(mapping = aes(color = cut), binwidth = 500)

ggplot(data = diamonds, mapping = aes(x = price)) +
  geom_histogram() +
  facet_wrap(~cut, ncol = 1, scales = "free_y")

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_violin() +
  coord_flip()
```

The violin plot was first described in Hintze and Nelson ( [1998](https://doi.org/10.1080/00031305.1998.10480559) ).

### 6. If you have a small dataset, it’s sometimes useful to use `geom_jitter()` to see the relationship between a continuous and categorical variable. The [ggbeeswarm](https://www.rdocumentation.org/packages/ggbeeswarm/versions/0.5.3) package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

There are two methods:

 - `geom_quasirandom()` produces plots that are a mix of jitter and violin plots. There are several different methods that determine exactly how the random location of the points is generated.
 - `geom_beeswarm()` produces a plot similar to a violin plot, but by offsetting the points.

I’ll use the `mpg` box plot example since these methods display individual points, they are better suited for smaller datasets.

```{r}
ggplot(data = mpg) +
  geom_quasirandom(mapping = aes(
    x = reorder(class, hwy, FUN = median),
    y = hwy
  ))

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "tukey"
  )

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "tukeyDense"
  )

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "frowney"
  )

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "smiley"
  )

ggplot(data = mpg) +
  geom_beeswarm(mapping = aes(
    x = reorder(class, hwy, FUN = median),
    y = hwy
  ))
```