7 Exploratory Data Analysis

7.3.4 Exercises
7.4.1 Exercises
- 1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
- 2. What does na.rm = TRUE do in mean() and sum()?
7.5.1.1 Exercises
7.5.2.1 Exercises
7.5.3.1 Exercises

7.3.4 Exercises

1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

diamonds %>% sample_frac(0.1) %>%
  plot_ly(x = ~x, y = ~y, z = ~z, color = ~price) %>%
  add_markers()

ばらつきや価格との関係など、x軸が重要っぽい

2. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

steps <- seq(from = 1, to = 1000) %>%
  map(~list(args = list("xbins.size", .x), label = .x, method = "restyle", value = .x))

diamonds %>%
  plot_ly() %>%
  add_histogram(x = ~price, xbins = list(size = 100)) %>%
  layout(sliders = list(
    list(
      active = 1,
      currentvalue = list(prefix = "binwidth: "),
      steps = steps
    )
  ))

3. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

diamonds %>%
  ggplot() +
  geom_freqpoly(aes(x = carat), binwidth = 0.01)

0.99や0.49などはダメ。超える方向(1.01)は大丈夫

4. Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

diamonds %>%
  ggplot() +
  geom_freqpoly(aes(x = carat)) +
  xlim(c(0, 1))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

diamonds %>%
  ggplot() +
  geom_freqpoly(aes(x = carat)) +
  coord_cartesian(xlim = c(0, 1))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

xlimは値を消すので、自動できまるbinwidthが変わる。

7.4.1 Exercises

1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

diamonds2 <- diamonds %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

diamonds2 %>%
  ggplot() +
  geom_histogram(aes(x = y))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 9 rows containing non-finite values (stat_bin).

diamonds2 %>%
  ggplot() +
  geom_bar(aes(x = y))

## Warning: Removed 9 rows containing non-finite values (stat_count).

わからん

2. What does na.rm = TRUE do in mean() and sum()?

na.rmをする

7.5.1.1 Exercises

1. Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.

flights %>% 
  ggplot() +
  geom_boxplot(aes(x = is.na(air_time), y = sched_dep_time))

2. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

diamonds %>%
  ggplot() +
  geom_point(aes(x = carat, y = price))

diamonds %>%
  ggplot() +
  geom_boxplot(aes(x = cut, y = price))

diamonds %>%
  ggplot() +
  geom_boxplot(aes(x = color, y = price))

caratなんじゃない?

3. Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?

library(ggstance)

## 
## Attaching package: 'ggstance'

## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh

diamonds %>%
  ggplot() +
  geom_boxplot(aes(color, price)) +
  coord_flip()

diamonds %>%
  ggplot() +
  geom_boxploth(aes(price, color))

便利か?

4. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?

library(lvplot)

diamonds %>%
  ggplot() +
  geom_lv(aes(cut, price, fill = ..LV..))

芋虫みたいでキモい。外れ値という括りは大きすぎて不適切な場合があるので、もっと細かく見ようという動き。この論文 (Letter-Value Plots: Boxplots for Large Data)を読むといいらしい。

5. Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?

diamonds %>%
  ggplot() +
  geom_histogram(aes(x = price)) +
  facet_wrap(~cut)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

diamonds %>%
  ggplot() +
  geom_freqpoly(aes(x = price,  color = cut))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

diamonds %>%
  ggplot() +
  geom_violin(aes(x = cut, y = price))

並べてみやすい。

6. If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

いろいろある

7.5.2.1 Exercises

1. How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?

count1 <- diamonds %>%
  count(color, cut)

## cut within color
count1 %>%
  group_by(color) %>%
  mutate(cut_p = n / sum(n)) %>%
  ggplot(aes(color, cut)) +
  geom_tile(aes(fill = cut_p)) +
  ggtitle("cut within color")

## color within cut
count1 %>%
  group_by(cut) %>%
  mutate(color_p = n / sum(n)) %>%
  ggplot(aes(color, cut)) +
  geom_tile(aes(fill = color_p)) +
  ggtitle("color within cut")

2. Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?

delay_count <- flights %>%
  group_by(month, dest) %>%
  summarise(d = mean(dep_delay, na.rm = TRUE)) %>%
  ungroup

delay_count %>%
  ggplot(aes(month, dest)) +
  geom_tile(aes(fill = d))

歯抜けが多いデータなので見辛いので存在しない組み合わせを0で埋める。

delay_count %>%
  complete(month, dest, fill = list(d = 0)) %>%
  ggplot(aes(month, dest)) +
  geom_tile(aes(fill = d))

???

3. Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?

legendのnのパターンと似た向きにした方が良いから?

7.5.3.1 Exercises

1. Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?

2. Visualise the distribution of carat, partitioned by price.

diamonds %>%
  ggplot() +
  geom_freqpoly(aes(x = carat, y = ..density.., color = cut_width(price, 10000)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3. How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?

上でやったプロットの逆

diamonds %>%
  ggplot() +
  geom_freqpoly(aes(x = price, y = ..density..,  color = cut_number(carat,  2)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4. Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.

diamonds %>%
  count(cut, carat = cut_number(carat, 20), wt = price) %>%
  ggplot() +
  geom_tile(aes(cut, carat, fill = n))

5. Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

ggplot(data = diamonds) +
  geom_bin2d(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

outliersのような細かい特徴を、ノイズとして排除したい場合にgeom_bin2dは適切。