Wk 4 Assignment: Data visualisation from the Hands-on Programming with R

title: "Untitled" author: "Suma Pendyala" date: "6/21/2020" output: html_document

Chapter 7 - Exploratory Data Analysis

1 - Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.1     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
ggplot(data = diamonds, mapping = aes(x = x)) +
  geom_density() +
  geom_rug() +
  labs(title = 'Distribution of x(length)')
plot of chunk unnamed-chunk-2
ggplot(data = diamonds, mapping = aes(x = y)) +
  geom_density() +
  geom_rug() +
  labs(title = 'Distribution of y(width)')
plot of chunk unnamed-chunk-3
ggplot(data = diamonds, mapping = aes(x = z)) +
  geom_density() +
  geom_rug() +
  labs(title = 'Distribution of z(depth)')
plot of chunk unnamed-chunk-4

2 - Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price), binwidth = 20)
plot of chunk unnamed-chunk-5

3 - How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

diamonds %>% filter(between(carat, .96, 1.05)) %>%
  group_by(carat) %>% summarize(count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 10 x 2
##    carat count
##    <dbl> <int>
##  1  0.96   103
##  2  0.97    59
##  3  0.98    31
##  4  0.99    23
##  5  1     1558
##  6  1.01  2242
##  7  1.02   883
##  8  1.03   523
##  9  1.04   475
## 10  1.05   361

4 - Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price), binwidth = 20) +
  coord_cartesian(xlim = c(0,5000), ylim = c(0,700))
plot of chunk unnamed-chunk-7
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price), binwidth = 20) +
  xlim(c(0,5000)) +
  ylim(c(0,700))
## Warning: Removed 14714 rows containing non-finite values (stat_bin).
## Warning: Removed 3 rows containing missing values (geom_bar).
plot of chunk unnamed-chunk-8

7.4.1 Exercises

data.frame(value = c(NA, NA, NA, rnorm(1000,0,1))) %>% ggplot() +
  geom_histogram(mapping = aes(x = value), bins = 50)
## Warning: Removed 3 rows containing non-finite values (stat_bin).
plot of chunk unnamed-chunk-9
ggplot(data = data.frame(type = c('A','A','B','B','B',NA))) +
  geom_bar(mapping = aes(x = type))
plot of chunk unnamed-chunk-10

2 - What does na.rm = TRUE do in mean() and sum()?

mean(c(1,2,3,NA,4), na.rm = TRUE)
## [1] 2.5

7.5.1.1 Exercises

nycflights13::flights %>%
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>%
  ggplot(mapping = aes(sched_dep_time)) +
    geom_density(mapping = aes(colour = cancelled))
plot of chunk unnamed-chunk-12
nycflights13::flights %>%
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>%
  ggplot() +
  geom_boxplot(mapping = aes(x = cancelled, y = sched_dep_time))
plot of chunk unnamed-chunk-13

2 - What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

diamonds %>%
  mutate(cut = as.numeric(cut),
         color = as.numeric(color),
         clarity = as.numeric(clarity)) %>%
  select(price, everything()) %>%
  cor()
##               price       carat         cut       color     clarity       depth
## price    1.00000000  0.92159130 -0.05349066  0.17251093 -0.14680007 -0.01064740
## carat    0.92159130  1.00000000 -0.13496702  0.29143675 -0.35284057  0.02822431
## cut     -0.05349066 -0.13496702  1.00000000 -0.02051852  0.18917474 -0.21805501
## color    0.17251093  0.29143675 -0.02051852  1.00000000  0.02563128  0.04727923
## clarity -0.14680007 -0.35284057  0.18917474  0.02563128  1.00000000 -0.06738444
## depth   -0.01064740  0.02822431 -0.21805501  0.04727923 -0.06738444  1.00000000
## table    0.12713390  0.18161755 -0.43340461  0.02646520 -0.16032684 -0.29577852
## x        0.88443516  0.97509423 -0.12556524  0.27028669 -0.37199853 -0.02528925
## y        0.86542090  0.95172220 -0.12146187  0.26358440 -0.35841962 -0.02934067
## z        0.86124944  0.95338738 -0.14932254  0.26822688 -0.36695200  0.09492388
##              table           x           y           z
## price    0.1271339  0.88443516  0.86542090  0.86124944
## carat    0.1816175  0.97509423  0.95172220  0.95338738
## cut     -0.4334046 -0.12556524 -0.12146187 -0.14932254
## color    0.0264652  0.27028669  0.26358440  0.26822688
## clarity -0.1603268 -0.37199853 -0.35841962 -0.36695200
## depth   -0.2957785 -0.02528925 -0.02934067  0.09492388
## table    1.0000000  0.19534428  0.18376015  0.15092869
## x        0.1953443  1.00000000  0.97470148  0.97077180
## y        0.1837601  0.97470148  1.00000000  0.95200572
## z        0.1509287  0.97077180  0.95200572  1.00000000
diamonds_con %
  mutate(cut = as.numeric(cut),
         color = as.numeric(color),
         clarity = as.numeric(clarity))

summary(lm(price ~ carat + cut + carat*cut, data = diamonds_con))
## Error: <text>:1:14: unexpected input
## 1: diamonds_con %
##                  ^

3 - Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?

nycflights13::flights %>%
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>%
  ggplot() +
  geom_boxplot(mapping = aes(x = cancelled, y = sched_dep_time)) +
  coord_flip()
plot of chunk unnamed-chunk-16
library(ggstance)
## Error in library(ggstance): there is no package called 'ggstance'
nycflights13::flights %>%
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>%
  ggplot() +
  geom_boxploth(mapping = aes(y = cancelled, x = sched_dep_time))
## Error in geom_boxploth(mapping = aes(y = cancelled, x = sched_dep_time)): could not find function "geom_boxploth"

4 - One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of "outlying values". One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?

4 - One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of "outlying values". One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?
## Error: <text>:1:9: unexpected symbol
## 1: 4 - One problem
##             ^
ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = cut, y = price))
plot of chunk unnamed-chunk-20
library(lvplot)
## Error in library(lvplot): there is no package called 'lvplot'
ggplot(data = diamonds) +
  geom_lv(mapping = aes(x = cut, y = price))
## Error in geom_lv(mapping = aes(x = cut, y = price)): could not find function "geom_lv"

5 - Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?

diamonds %>% ggplot() +
  geom_histogram(mapping = aes(x = price), binwidth = 50) +
  facet_grid(cut~.)
plot of chunk unnamed-chunk-23
diamonds %>% ggplot() +
  geom_violin(mapping = aes(x = cut, y = price))
plot of chunk unnamed-chunk-24
diamonds %>% ggplot() +
  geom_freqpoly(mapping = aes(x = price, color = cut), binwidth = 50)
plot of chunk unnamed-chunk-25

6 - If you have a small dataset, it's sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

ggplot(data = mpg) +
  geom_jitter(mapping = aes(x = drv, y = displ))
plot of chunk unnamed-chunk-26
library(ggbeeswarm)
## Error in library(ggbeeswarm): there is no package called 'ggbeeswarm'
ggplot(data = mpg) +
  geom_beeswarm(mapping = aes(x = drv, y = displ), priority = 'ascending')
## Error in geom_beeswarm(mapping = aes(x = drv, y = displ), priority = "ascending"): could not find function "geom_beeswarm"

7.5.2.1 Exercises

1 - How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?

diamonds %>% count(color, cut) %>% group_by(color) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot() +
  geom_tile(mapping = aes(x = color, y = cut, fill = prop)) +
  labs(title = 'Distribution of cut within color')
plot of chunk unnamed-chunk-29

2 - Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?

nycflights13::flights %>% group_by(dest, month) %>%
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot() +
  geom_tile(mapping = aes(x = month, y = dest, fill = avg_dep_delay))
## `summarise()` regrouping output by 'dest' (override with `.groups` argument)
plot of chunk unnamed-chunk-30
nycflights13::flights %>% group_by(dest, month) %>%
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ungroup() %>%
  group_by(dest) %>%
  mutate(n_month = n())%>%
  ggplot() +
  geom_tile(mapping = aes(x = factor(month),
                          y = reorder(dest, n_month),
                          fill = avg_dep_delay)) +
  scale_fill_gradient2(low = 'yellow', mid = 'orange', high = 'red',
                       midpoint = 35)
## `summarise()` regrouping output by 'dest' (override with `.groups` argument)
plot of chunk unnamed-chunk-31

3 - Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?

diamonds %>% count(color, cut) %>% group_by(color) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot() +
  geom_tile(mapping = aes(x = cut, y = color, fill = prop)) +
  labs(title = 'Distribution of cut within color')
plot of chunk unnamed-chunk-32

7.5.3.1 Exercises

1 - Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?

diamonds %>% ggplot() +
  geom_freqpoly(mapping = aes(x = price,
                              color = cut_width(carat, .2)), bins = 30)
plot of chunk unnamed-chunk-33
diamonds %>% ggplot() +
  geom_freqpoly(mapping = aes(x = price,
                              color = cut_width(carat, .4)), bins = 30)
plot of chunk unnamed-chunk-34
diamonds %>% ggplot() +
  geom_freqpoly(mapping = aes(x = price,
                              color = cut_number(carat, 10)), bins = 30)
plot of chunk unnamed-chunk-35

2 - Visualise the distribution of carat, partitioned by price.

diamonds %>% ggplot() +
  geom_density(mapping = aes(x = carat,
                             color = cut_width(price, 5000, boundary = 0)))
plot of chunk unnamed-chunk-36

3 - How does the price distribution of very large diamonds compare to small diamonds. Is it as you expect, or does it surprise you?

diamonds %>% ggplot +
  geom_boxplot(mapping = aes(x = cut_number(carat, 10),
                             y = price)) +
  coord_flip()
plot of chunk unnamed-chunk-37

4 - Combine two of the techniques you've learned to visualise the combined distribution of cut, carat, and price.

diamonds %>% ggplot() +
  geom_boxplot(mapping = aes(x = cut, y = price,
                             color = cut_number(carat, 5)))
plot of chunk unnamed-chunk-38
diamonds %>% mutate(carat_group = cut_number(carat, 10)) %>%
  group_by(cut, carat_group) %>%
  summarize(avg_price = mean(price)) %>%
  ggplot() +
  geom_tile(mapping = aes(x = cut, y = carat_group,
                          fill = avg_price))
## `summarise()` regrouping output by 'cut' (override with `.groups` argument)
plot of chunk unnamed-chunk-39
diamonds %>% ggplot() +
  geom_bin2d(mapping = aes(x = carat, y = price)) +
  facet_grid(cut~.)
plot of chunk unnamed-chunk-40
ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
plot of chunk unnamed-chunk-41
ggplot(data = diamonds) +
  geom_bin2d(mapping = aes(x = x, y = y), bins = 800) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
plot of chunk unnamed-chunk-42