###Exercise 7.3.4 1.Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth. ###Answer
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
ggplot( data = diamonds ) +
geom_freqpoly(binwidth=0.1,aes(x = x ), color = "red") +
geom_freqpoly(binwidth=0.1,aes(x = y ), color = "blue") +
geom_freqpoly(binwidth=0.1,aes(x = z ), color ="green")
diamonds %>%
mutate(id = row_number()) %>%
select(x, y, z, id) %>%
gather(variable, value, -id) %>%
ggplot(aes(x = value)) +
geom_density() +
geom_rug() +
facet_grid(variable ~ .)
The z values are much smaller than the x and y values so I think it is the depth.
2.Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.) ###Answer
ggplot( data = diamonds, aes( x = price))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The graph is skewed to the left (0) which is unusual for diamonds to be
priced so low.
3.How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference? ###Answer
diamonds %>% group_by(carat) %>% count() %>%
ggplot() +
geom_histogram( aes( x = carat ) ) +
coord_cartesian(xlim=c(0,2))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
filter(diamonds,carat==0.99) %>% count()
## # A tibble: 1 × 1
## n
## <int>
## 1 23
filter(diamonds,carat==1) %>% count()
## # A tibble: 1 × 1
## n
## <int>
## 1 1558
There are 23 0.99 carat and 15558 1 carat and I think it is because people ask for a round figure more than they ould ask for a continuous figure of diamond.
4.Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows? ###Answer
diamonds %>%
ggplot() +
geom_histogram( aes( x = carat ) ) +
coord_cartesian(xlim=c(0,5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
diamonds %>%
ggplot() +
geom_histogram( aes( x = carat ) ) +
xlim( c(0,5) )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
diamonds %>%
ggplot() +
geom_histogram( aes( x = carat ) ) +
coord_cartesian(ylim=c(0,1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
diamonds %>%
ggplot() +
geom_histogram( aes( x = carat ) ) +
ylim( c(0,1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing missing values (geom_bar).
One row is missing in the x axis.
###Exercise 7.4.1 1.What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference? ###Answer
nycflights13::flights %>%
count(is.na(dep_time)==TRUE)
## # A tibble: 2 × 2
## `is.na(dep_time) == TRUE` n
## <lgl> <int>
## 1 FALSE 328521
## 2 TRUE 8255
nycflights13::flights %>%
ggplot() +
geom_bar(aes(x = dep_time))
## Warning: Removed 8255 rows containing non-finite values (stat_count).
nycflights13::flights %>%
ggplot() +
geom_histogram(aes(x = dep_time))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8255 rows containing non-finite values (stat_bin).
nycflights13::flights %>%
ggplot() +
geom_histogram(binwidth=1, aes(x = dep_time))
## Warning: Removed 8255 rows containing non-finite values (stat_bin).
The binwidth shows that there are possible missing points on the graph.
2.What does na.rm = TRUE do in mean() and sum()? ###Answer
nycflights13::flights %>%
group_by(dep_time) %>%
count()
## # A tibble: 1,319 × 2
## # Groups: dep_time [1,319]
## dep_time n
## <int> <int>
## 1 1 25
## 2 2 35
## 3 3 26
## 4 4 26
## 5 5 21
## 6 6 22
## 7 7 22
## 8 8 23
## 9 9 28
## 10 10 22
## # … with 1,309 more rows
nycflights13::flights %>%
group_by(dep_time) %>%
count() %>%
sum()
## [1] NA
nycflights13::flights %>%
group_by(dep_time) %>%
count() %>%
sum(na.rm=TRUE)
## [1] 1995334
na.rm removes the NA values to enable sum/ mean to be computed.
###Exercise7.5.1 1.Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights. ###Answer
cancellation <- nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
)
ggplot( data = cancellation, mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
ggplot(data = cancellation , mapping = aes( x = cancelled , y = sched_dep_time) ) +
geom_boxplot()
cancelled flights are not as frequent as flights that were not
cancelled
2.What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive? ###Answer
3.Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()? ###Answer
ggplot( data = cancellation) +
geom_boxplot( mapping = aes( x = cancelled , y = sched_dep_time ) ) +
coord_flip()
4.One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots? ###Answer
ggplot(diamonds,aes(x=cut,y=price))+
geom_boxplot()+coord_flip()
require(lvplot)
## Loading required package: lvplot
ggplot(diamonds,aes(x=cut,y=price))+
geom_lv(aes(fill=..LV..))+coord_flip()
5.Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method? ###Answer
ggplot(diamonds,aes(x=cut,y=price))+
geom_violin()
ggplot(diamonds,aes(x=price))+
geom_histogram()+facet_wrap(~cut)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds,aes(color=cut,x=price))+
geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
6.If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.
###Answer
require(ggbeeswarm)
## Loading required package: ggbeeswarm
###Exercise 7.5.2 1.How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut? ###Answer
2.Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it? ###Answer
3.Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above? ###Answer
###Exercise7.5.3 1.Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price? ###Answer
2.Visualise the distribution of carat, partitioned by price. ###Answer
3.How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you? ###Answer
4.Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price. ###Answer
5.Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
###Answer
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.