1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

The “previous plot” referred to in the question is the following.

suppressPackageStartupMessages(library(tidyverse))
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3
ggplot(data = diamonds) +
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

The default geom for stat_summary() is geom_pointrange(). The default stat for geom_pointrange() is identity() but we can add the argument stat = "summary" to use stat_summary() instead of stat_identity().

ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary"
  )

The resulting message says that stat_summary() uses the mean and sd to calculate the middle point and endpoints of the line. However, in the original plot the min and max values were used for the endpoints. To recreate the original plot we need to specify values for fun.ymin, fun.ymax, and fun.y.

ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary",
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

2. What does geom_col() do? How is it different to geom_bar()?

The geom_col() function has different default stat than geom_bar(). The default stat of geom_col() is stat_identity(), which leaves the data as is. The geom_col() function expects that the data contains x values and y values which represent the bar height.

The default stat of geom_bar() is stat_bin(). The geom_bar() function only expects an x variable. The stat, stat_bin(), preprocesses input data by counting the number of observations for each value of x. The y aesthetic uses the values of these counts.

3. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

The following tables lists the pairs of geoms and stats that are almost always used in concert.

Complementary geoms and stats

geom stat
geom_bar() stat_count()
geom_bin2d() stat_bin_2d()
geom_boxplot() stat_boxplot()
geom_contour() stat_contour()
geom_count() stat_sum()
geom_density() stat_density()
geom_density_2d() stat_density_2d()
geom_hex() stat_hex()
geom_freqpoly() stat_bin()
geom_histogram() stat_bin()
geom_qq_line() stat_qq_line()
geom_qq() stat_qq()
geom_quantile() stat_quantile()
geom_smooth() stat_smooth()
geom_violin() stat_violin()
geom_sf() stat_sf()

They tend to have their names in common, stat_smooth() and geom_smooth(). However, this is not always the case, with geom_bar() and stat_count() and geom_histogram() and geom_bin() as notable counter-examples. Also, the pairs of geoms and stats that are used in concert almost always have each other as the default stat (for a geom) or geom (for a stat).

The following tables contain the geoms and stats in ggplot2.

ggplot2 geom layers and their default stats.

geom default stat shared docs
geom_abline()
geom_hline()
geom_vline()
geom_bar() stat_count() x
geom_col()
geom_bin2d() stat_bin_2d() x
geom_blank()
geom_boxplot() stat_boxplot() x
geom_countour() stat_countour() x
geom_count() stat_sum() x
geom_density() stat_density() x
geom_density_2d() stat_density_2d() x
geom_dotplot()
geom_errorbarh()
geom_hex() stat_hex() x
geom_freqpoly() stat_bin() x
geom_histogram() stat_bin() x
geom_crossbar()
geom_errorbar()
geom_linerange()
geom_pointrange()
geom_map()
geom_point()
geom_map()
geom_path()
geom_line()
geom_step()
geom_point()
geom_polygon()
geom_qq_line() stat_qq_line() x
geom_qq() stat_qq() x
geom_quantile() stat_quantile() x
geom_ribbon()
geom_area()
geom_rug()
geom_smooth() stat_smooth() x
geom_spoke()
geom_label()
geom_text()
geom_raster()
geom_rect()
geom_tile()
geom_violin() stat_ydensity() x
geom_sf() stat_sf() x

ggplot2 stat layers and their default geoms.

stat default geom shared docs
stat_ecdf() geom_step()
stat_ellipse() geom_path()
stat_function() geom_path()
stat_identity() geom_point()
stat_summary_2d() geom_tile()
stat_summary_hex() geom_hex()
stat_summary_bin() geom_pointrange()
stat_summary() geom_pointrange()
stat_unique() geom_point()
stat_count() geom_bar() x
stat_bin_2d() geom_tile() x
stat_boxplot() geom_boxplot() x
stat_countour() geom_contour() x
stat_sum() geom_point() x
stat_density() geom_area() x
stat_density_2d() geom_density_2d() x
stat_bin_hex() geom_hex() x
stat_bin() geom_bar() x
stat_qq_line() geom_path() x
stat_qq() geom_point() x
stat_quantile() geom_quantile() x
stat_smooth() geom_smooth() x
stat_ydensity() geom_violin() x
stat_sf() geom_rect() x

4. What variables does stat_smooth() compute? What parameters control its behavior?

The function stat_smooth() calculates the following variables:

  • y: predicted value
  • ymin: lower value of the confidence interval
  • ymax: upper value of the confidence interval
  • se: standard error

The “Computed Variables” section of the stat_smooth() documentation contains these variables.

The parameters that control the behavior of stat_smooth() include

  • method: the method used to
  • formula: the formula are parameters such as method which determines which method is used to calculate the predictions and confidence interval, and some other arguments that are passed to that.
  • na.rm:

5. In our proportion bar chart, we need to set group = 1 Why? In other words, what is the problem with these two graphs?

If group = 1 is not included, then all the bars in the plot will have the same height, a height of 1. The function geom_bar() assumes that the groups are equal to the x values since the stat computes the counts within the group.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = ..prop..))

The problem with these two plots is that the proportions are calculated within the groups.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = ..prop..))


ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))

The following code will produce the intended stacked bar charts for the case with no fill aesthetic.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

With the fill aesthetic, the heights of the bars need to be normalized.

ggplot(data = diamonds) +
  geom_bar(aes(x = cut, y = ..count.. / sum(..count..), fill = color))

---
title: "Statistical Transforms Demo"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

### 1. What is the default geom associated with `stat_summary()`? How could you rewrite the previous plot to use that geom function instead of the stat function?

The “previous plot” referred to in the question is the following.
```{r previous}
suppressPackageStartupMessages(library(tidyverse))
ggplot(data = diamonds) +
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )
```

The default geom for [stat_summary()](https://ggplot2.tidyverse.org/reference/stat_summary.html) is `geom_pointrange()`. The default stat for [geom_pointrange()](https://ggplot2.tidyverse.org/reference/geom_linerange.html) is `identity()` but we can add the argument `stat = "summary"` to use `stat_summary()` instead of `stat_identity()`.

```{r nosummary}
ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary"
  )
```

The resulting message says that `stat_summary()` uses the `mean` and `sd` to calculate the middle point and endpoints of the line. However, in the original plot the min and max values were used for the endpoints. To recreate the original plot we need to specify values for `fun.ymin`, `fun.ymax`, and `fun.y`.

```{r newvalues}
ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary",
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )
```


### 2. What does `geom_col()` do? How is it different to `geom_bar()`?

The `geom_col()` function has different default stat than `geom_bar()`. The default stat of `geom_col()` is `stat_identity()`, which leaves the data as is. The `geom_col()` function expects that the data contains `x` values and `y` values which represent the bar height.

The default stat of `geom_bar()` is `stat_bin()`. The `geom_bar()` function only expects an `x` variable. The stat, `stat_bin()`, preprocesses input data by counting the number of observations for each value of `x`. The `y` aesthetic uses the values of these counts.

### 3. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

The following tables lists the pairs of geoms and stats that are almost always used in concert.

**Complementary geoms and stats**

| geom                | stat                |
|---------------------|---------------------|
| `geom_bar()`        | `stat_count()`      |
| `geom_bin2d()`      | `stat_bin_2d()`     |
| `geom_boxplot()`    | `stat_boxplot()`    |
| `geom_contour()`    | `stat_contour()`    |
| `geom_count()`      | `stat_sum()`        |
| `geom_density()`    | `stat_density()`    |
| `geom_density_2d()` | `stat_density_2d()` |
| `geom_hex()`        | `stat_hex()`        |
| `geom_freqpoly()`   | `stat_bin()`        |
| `geom_histogram()`  | `stat_bin()`        |
| `geom_qq_line()`    | `stat_qq_line()`    |
| `geom_qq()`         | `stat_qq()`         |
| `geom_quantile()`   | `stat_quantile()`   |
| `geom_smooth()`     | `stat_smooth()`     |
| `geom_violin()`     | `stat_violin()`     |
| `geom_sf()`         | `stat_sf()`         |

They tend to have their names in common, `stat_smooth()` and `geom_smooth()`. However, this is not always the case, with `geom_bar()` and `stat_count()` and `geom_histogram()` and `geom_bin()` as notable counter-examples. Also, the pairs of geoms and stats that are used in concert almost always have each other as the default stat (for a geom) or geom (for a stat).

The following tables contain the geoms and stats in [ggplot2](https://ggplot2.tidyverse.org/reference/).

**ggplot2 geom layers and their default stats.**

| geom                | default stat      | shared docs |
|---------------------|-------------------|-------------|
| `geom_abline()`     |                   |             |
| `geom_hline()`      |                   |             |
| `geom_vline()`      |                   |             |
| `geom_bar()`        | `stat_count()`      | x           |
| `geom_col()`        |                   |             |
| `geom_bin2d()`      | `stat_bin_2d()`     | x           |
| `geom_blank()`      |                   |             |
| `geom_boxplot()`    | `stat_boxplot()`    | x           |
| `geom_countour()`   | `stat_countour()`   | x           |
| `geom_count()`      | `stat_sum()`        | x           |
| `geom_density()`    | `stat_density()`    | x           |
| `geom_density_2d()` | `stat_density_2d()` | x           |
| `geom_dotplot()`    |                   |             |
| `geom_errorbarh()`  |                   |             |
| `geom_hex()`        | `stat_hex()`        | x           |
| `geom_freqpoly()`   | `stat_bin()`        | x           |
| `geom_histogram()`  | `stat_bin()`        | x           |
| `geom_crossbar()`   |                   |             |
| `geom_errorbar()`   |                   |             |
| `geom_linerange()`  |                   |             |
| `geom_pointrange()` |                   |             |
| `geom_map()`        |                   |             |
| `geom_point()`      |                   |             |
| `geom_map()`        |                   |             |
| `geom_path()`       |                   |             |
| `geom_line()`       |                   |             |
| `geom_step()`       |                   |             |
| `geom_point()`      |                   |             |
| `geom_polygon()`    |                   |             |
| `geom_qq_line()`    | `stat_qq_line()`    | x           |
| `geom_qq()`         | `stat_qq()`         | x           |
| `geom_quantile()`   | `stat_quantile()`   | x           |
| `geom_ribbon()`     |                   |             |
| `geom_area()`       |                   |             |
| `geom_rug()`        |                   |             |
| `geom_smooth()`     | `stat_smooth()`     | x           |
| `geom_spoke()`      |                   |             |
| `geom_label()`      |                   |             |
| `geom_text()`       |                   |             |
| `geom_raster()`     |                   |             |
| `geom_rect()`       |                   |             |
| `geom_tile()`       |                   |             |
| `geom_violin()`     | `stat_ydensity()`   | x           |
| `geom_sf()`         | `stat_sf()`         | x           |

**ggplot2 stat layers and their default geoms.**

| stat                 | default geom        | shared docs |
|----------------------|---------------------|-------------|
| `stat_ecdf()`        | `geom_step()`       |             |
| `stat_ellipse()`     | `geom_path()`       |             |
| `stat_function()`    | `geom_path()`       |             |
| `stat_identity()`    | `geom_point()`      |             |
| `stat_summary_2d()`  | `geom_tile()`       |             |
| `stat_summary_hex()` | `geom_hex()`        |             |
| `stat_summary_bin()` | `geom_pointrange()` |             |
| `stat_summary()`     | `geom_pointrange()` |             |
| `stat_unique()`      | `geom_point()`      |             |
| `stat_count()`       | `geom_bar()`        | x           |
| `stat_bin_2d()`      | `geom_tile()`       | x           |
| `stat_boxplot()`     | `geom_boxplot()`    | x           |
| `stat_countour()`    | `geom_contour()`    | x           |
| `stat_sum()`         | `geom_point()`      | x           |
| `stat_density()`     | `geom_area()`       | x           |
| `stat_density_2d()`  | `geom_density_2d()` | x           |
| `stat_bin_hex()`     | `geom_hex()`        | x           |
| `stat_bin()`         | `geom_bar()`        | x           |
| `stat_qq_line()`     | `geom_path()`       | x           |
| `stat_qq()`          | `geom_point()`      | x           |
| `stat_quantile()`    | `geom_quantile()`   | x           |
| `stat_smooth()`      | `geom_smooth()`     | x           |
| `stat_ydensity()`    | `geom_violin()`     | x           |
| `stat_sf()`          | `geom_rect()`       | x           |


### 4. What variables does `stat_smooth()` compute? What parameters control its behavior?

The function `stat_smooth()` calculates the following variables:

 - `y`: predicted value
 - `ymin`: lower value of the confidence interval
 - `ymax`: upper value of the confidence interval
 - `se`: standard error

The “Computed Variables” section of the `stat_smooth()` documentation contains these variables.

The parameters that control the behavior of `stat_smooth()` include

 - `method`: the method used to
 - `formula`: the formula are parameters such as method which determines which method is used to calculate the predictions and confidence interval, and some other arguments that are passed to that.
 - `na.rm`:

### 5. In our proportion bar chart, we need to set `group = 1` Why? In other words, what is the problem with these two graphs?

If `group = 1` is not included, then all the bars in the plot will have the same height, a height of 1. The function `geom_bar()` assumes that the groups are equal to the `x` values since the stat computes the counts within the group.

```{r diamonds}
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = ..prop..))
```

The problem with these two plots is that the proportions are calculated within the groups.

```{r twoplots}
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = ..prop..))

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
```

The following code will produce the intended stacked bar charts for the case with no `fill` aesthetic.

```{r nofill}
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
```

With the `fill` aesthetic, the heights of the bars need to be normalized.

```{r normal}
ggplot(data = diamonds) +
  geom_bar(aes(x = cut, y = ..count.. / sum(..count..), fill = color))
```


