Chapter 7

###Exercise 7.3.4 1.Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth. ###Answer

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

ggplot( data = diamonds ) +
  geom_freqpoly(binwidth=0.1,aes(x = x ), color = "red") +
  geom_freqpoly(binwidth=0.1,aes(x = y ), color = "blue") +
  geom_freqpoly(binwidth=0.1,aes(x = z ), color  ="green")

diamonds %>%
  mutate(id = row_number()) %>%
  select(x, y, z, id) %>%
  gather(variable, value, -id)  %>%
  ggplot(aes(x = value)) +
  geom_density() +
  geom_rug() +
  facet_grid(variable ~ .)

The z values are much smaller than the x and y values so I think it is the depth.

2.Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.) ###Answer

ggplot( data = diamonds, aes( x = price))+ 
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The graph is skewed to the left (0) which is unusual for diamonds to be priced so low.

3.How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference? ###Answer

diamonds %>% group_by(carat) %>% count() %>%
  ggplot() +
  geom_histogram( aes( x = carat ) ) +
  coord_cartesian(xlim=c(0,2))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

filter(diamonds,carat==0.99) %>% count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1    23

filter(diamonds,carat==1) %>% count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1  1558

There are 23 0.99 carat and 15558 1 carat and I think it is because people ask for a round figure more than they ould ask for a continuous figure of diamond.

4.Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows? ###Answer

diamonds %>% 
  ggplot() +
  geom_histogram( aes( x = carat ) ) +
  coord_cartesian(xlim=c(0,5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

diamonds %>% 
  ggplot() +
  geom_histogram( aes( x = carat ) ) +
  xlim( c(0,5) )

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

diamonds %>% 
  ggplot() +
  geom_histogram( aes( x = carat ) ) +
  coord_cartesian(ylim=c(0,1000))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

diamonds %>% 
  ggplot() +
  geom_histogram( aes( x = carat ) ) +
  ylim( c(0,1000))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 10 rows containing missing values (geom_bar).

One row is missing in the x axis.

###Exercise 7.4.1 1.What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference? ###Answer

nycflights13::flights %>%
    count(is.na(dep_time)==TRUE)

## # A tibble: 2 × 2
##   `is.na(dep_time) == TRUE`      n
##   <lgl>                      <int>
## 1 FALSE                     328521
## 2 TRUE                        8255

nycflights13::flights %>%
    ggplot() + 
  geom_bar(aes(x = dep_time))

## Warning: Removed 8255 rows containing non-finite values (stat_count).

nycflights13::flights %>%
    ggplot() + 
  geom_histogram(aes(x = dep_time))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 8255 rows containing non-finite values (stat_bin).

nycflights13::flights %>%
    ggplot() + 
  geom_histogram(binwidth=1, aes(x = dep_time))

## Warning: Removed 8255 rows containing non-finite values (stat_bin).

The binwidth shows that there are possible missing points on the graph. 2.What does na.rm = TRUE do in mean() and sum()? ###Answer

nycflights13::flights %>%
    group_by(dep_time) %>%
  count()

## # A tibble: 1,319 × 2
## # Groups:   dep_time [1,319]
##    dep_time     n
##       <int> <int>
##  1        1    25
##  2        2    35
##  3        3    26
##  4        4    26
##  5        5    21
##  6        6    22
##  7        7    22
##  8        8    23
##  9        9    28
## 10       10    22
## # … with 1,309 more rows

nycflights13::flights %>%
    group_by(dep_time) %>%
  count() %>%
  sum()

## [1] NA

nycflights13::flights %>%
    group_by(dep_time) %>%
  count() %>%
  sum(na.rm=TRUE)

## [1] 1995334

na.rm removes the NA values to enable sum/ mean to be computed.

###Exercise7.5.1 1.Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights. ###Answer

cancellation <- nycflights13::flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  )
  
  ggplot( data = cancellation, mapping = aes(sched_dep_time)) + 
    geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)

ggplot(data = cancellation , mapping = aes( x = cancelled , y = sched_dep_time) ) +
  geom_boxplot()

cancelled flights are not as frequent as flights that were not cancelled

2.What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive? ###Answer

3.Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()? ###Answer

ggplot( data = cancellation) +
  geom_boxplot( mapping = aes( x = cancelled , y = sched_dep_time ) ) + 
  coord_flip()

4.One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots? ###Answer

ggplot(diamonds,aes(x=cut,y=price))+
  geom_boxplot()+coord_flip()

require(lvplot)

## Loading required package: lvplot

ggplot(diamonds,aes(x=cut,y=price))+
  geom_lv(aes(fill=..LV..))+coord_flip()

5.Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method? ###Answer

ggplot(diamonds,aes(x=cut,y=price))+ 
  geom_violin()

ggplot(diamonds,aes(x=price))+
  geom_histogram()+facet_wrap(~cut)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diamonds,aes(color=cut,x=price))+
  geom_freqpoly()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

6.If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

###Answer

require(ggbeeswarm)

## Loading required package: ggbeeswarm

###Exercise 7.5.2 1.How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut? ###Answer

2.Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it? ###Answer

3.Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above? ###Answer

###Exercise7.5.3 1.Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price? ###Answer

2.Visualise the distribution of carat, partitioned by price. ###Answer

3.How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you? ###Answer

4.Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price. ###Answer

5.Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

###Answer

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Chapter 7

July 28,2022

Including Plots