Chapter7hw.knit

title: “Chapter7 Homework”

Ferhiwot kidane

output: html_document

date: ‘2022-08-01’

###Exersise7.3.4

1)Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

# remove false data points
diamonds <- diamonds %>% filter(2 < y & y < 20 & 2 < x & 2 < z & z < 20)
ggplot(diamonds) +
  geom_freqpoly(aes(x = x), binwidth = 0.01)

ggplot(diamonds) +
  geom_freqpoly(aes(x = y), binwidth = 0.01)

ggplot(diamonds) +
  geom_freqpoly(aes(x = z), binwidth = 0.01)

# x and y often share value
ggplot(diamonds) +
  geom_point(aes(x = x, y = y)) +
  geom_point(aes(x = x, y = z), color = "purple") +
  coord_fixed()

I think x and y should be length and width, and z is depth.

2)Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

# remove false data points
diamonds <- diamonds %>% filter(2 < y & y < 20 & 2 < x & 2 < z & z < 20)
ggplot(diamonds) + 
  geom_freqpoly(aes(x = price), binwidth = 10) +
  xlim(c(1000, 2000))

## Warning: Removed 44207 rows containing non-finite values (stat_bin).

## Warning: Removed 2 row(s) containing missing values (geom_path).

## Warning: Removed 44207 rows containing non-finite values (stat_bin).

Warning: Removed 2 rows containing missing values (geom_path).

we don’t have diamonds that are priced around $1500.

3)How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference? ###Answer

diamonds %>% filter(carat == 0.99) %>% count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1    23

diamonds %>% filter(carat == 1) %>% count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1  1556

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.01) +
  xlim(c(0.97, 1.03))

## Warning: Removed 48599 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 48599 rows containing non-finite values (stat_bin).

More diamonds with 1 carat. I think it is because psychologically, 1 carat represent a whole new level from 0.99 carat, so for makers, it is little more material for much more value.

4)Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows? ###Answer

ggplot(diamonds) + 
  geom_histogram(aes(x = carat)) +
  xlim(c(0.97, 1.035))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 48599 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 48599 rows containing non-finite values (stat_bin).

Warning: Removed 1 rows containing missing values (geom_bar).

ggplot(diamonds) + 
  geom_histogram(aes(x = carat)) +
  coord_cartesian(xlim = c(0.97, 1.035))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.01) +
  xlim(c(0.97, 1.035))

## Warning: Removed 48599 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing missing values (geom_bar).

## Warning: Removed 48599 rows containing non-finite values (stat_bin).

ggplot(diamonds) + 
  geom_histogram(aes(x = carat), binwidth = 0.01) +
  coord_cartesian(xlim = c(0.97, 1.035))

coord_cartesian() plots and cuts, while xlim() cuts and plots. So xlim() does not show the half bar.

###7.4 Missing values

###Exersises7.4.1

1)What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

###Answer

set.seed(0)
df <- tibble(norm = rnorm(100)) %>% mutate(inrange = ifelse(norm > 2, NA, norm))
ggplot(df) +
  geom_histogram(aes(x = inrange))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing non-finite values (stat_bin).

## stat_bin() using bins = 30. Pick better value with binwidth.

Warning: Removed 2 rows containing non-finite values (stat_bin).

geom_histogram() removed rows with NA values

df <- diamonds %>% mutate(cut = as.factor(ifelse(y > 7, NA, cut)))
ggplot(df) + geom_bar(aes(x = cut))

Apparently geom_bar() doesn’t remove NA, but rather treat it as another factor or category.

7.5 Covariation

###Exersise7.5.1.1

1)Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.

cancellation <- nycflights13::flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  )
  
  ggplot( data = cancellation, mapping = aes(sched_dep_time)) + 
    geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)

## Warning: Removed 8255 rows containing non-finite values (stat_boxplot).

The frequency polygon and the boxplot essentially display the same thing, that the majority of cancelled flights occur later in the day, ~ 15, or 3 pm. Note however that we have to supply a density argument to the frequency polygon graph in order to control for there being significantly more non-cancelled flights than cancelled flights.

2)What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive? ###Answer

ggplot(diamonds) +
  geom_point(aes(x = carat, y = price), color = "blue", alpha = 0.5)

ggplot(diamonds) +
  geom_point(aes(x = depth, y = price), color = "red", alpha = 0.5)

ggplot(diamonds) +
  geom_point(aes(x = table, y = price), color = "red", alpha = 0.5)

ggplot(diamonds) +
  geom_point(aes(x = x, y = price), color = "red", alpha = 0.5)

ggplot(diamonds) +
  geom_point(aes(x = z, y = price), color = "red", alpha = 0.5)

Volumn and weight are two variables that is most important for predicting the price. Since volumn is highly correlated with weight, they can be considered to be one variable.

ggplot(diamonds) +
  geom_boxplot(aes(x = cut, y = carat))

Because better cut has lower carat which makes their price lower, so if we don’t look at carat, it would appear that better cut has lower price.

3)Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()? ###Answer

library(ggstance)

## 
## Attaching package: 'ggstance'

## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh

Attaching package: ‘ggstance’

The following objects are masked from ‘package:ggplot2’:

geom_errorbarh, GeomErrorbarh

ggplot(diamonds) + geom_boxplot(aes(x = cut, y = carat)) + coord_flip()

ggplot(diamonds) + geom_boxploth(aes(x = carat, y = cut))

The result is the same; but the call of the function seems more natural.

4)One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?

###Answer

library(lvplot)
ggplot(diamonds) + geom_lv(aes(x = cut, y = price))

While the boxplot only shows a few quantiles and outliers, the letter-value plot shows many quantiles.

5)Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?

###Answer

ggplot(diamonds) +
  geom_histogram(aes(x = price)) +
  facet_wrap(~cut)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diamonds) +
  geom_freqpoly(aes(x = price)) +
  facet_wrap(~cut)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## stat_bin() using bins = 30. Pick better value with binwidth.

ggplot(diamonds) +
  geom_violin(aes(x = cut, y = price))

ggplot(diamonds) +
  geom_lv(aes(x = cut, y = price))

Violin plot is best to compare the density distribution across different categories

If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

###Answer

require(ggbeeswarm)

## Loading required package: ggbeeswarm

###Exersise 7.5.2.1 1)How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?

###Answer

diamonds %>%
  group_by(color) %>%
  count(color, cut) %>%
  mutate(
    prop = n/sum(n)
    ) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(aes(fill = prop))

diamonds %>%
  group_by(cut) %>%
  count(cut, color) %>%
  mutate(
    prop = n/sum(n)
  ) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(aes(fill = prop))

2)Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it? ###Answer

library("nycflights13")

flights %>%
  group_by(month, dest) %>%
  summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = factor(month), y = dest, fill = dep_delay)) +
  geom_tile() +
  labs(x = "Month", y = "Destination", fill = "Departure Delay")

## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

#> summarise() regrouping output by ‘month’ (override with .groups argument)

A few things make the plot difficult to read.

There are too many destinations for all of the fields to fit comfortably on either axis. 2.) The month variable is contained within the data as an integer when it really is more of a nominal value with discrete values between 1:12.

We could fix these issues by grouping destinations by state and by converting month to a true discrete nominal variable. The latter issue is easier to fix, we can simply use factor(month).

3)Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?

diamonds %>%
  count(color, cut) %>%
  ggplot(mapping = aes(y = color, x = cut)) +
  geom_tile(mapping = aes(fill = n))

Another reason, for switching the order is that the larger numbers are at the top when x = color and y = cut, and that lowers the cognitive burden of interpreting the plot.

7.5.3 Two continuous variables

###Exersise7.5.3.1

1)Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?

###Answer

Both cut_width() and cut_number() split a variable into groups, but bins the values differently. cut_width bins the data into x number of bins all of about the same width if varwidth = TRUE is not specified. cut_number() divides the data into x number of bins where each bin has the same number of values within it.

ggplot(
  data = diamonds,
  mapping = aes(color = cut_number(carat, 5), x = price)
) +
  geom_freqpoly() +
  labs(x = "Price", y = "Count", color = "Carat")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(
  data = diamonds,
  mapping = aes(color = cut_width(carat, 1, boundary = 0), x = price)
) +
  geom_freqpoly() +
  labs(x = "Price", y = "Count", color = "Carat")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

When looking at graphs produced using these it is important to pay attention to the bins. Both will capture all of the data, but cut_width is better for telling you apprx. how many under a curve whereas cut_number is better at giving you an idea of how many within a range are necessary to form the curve.

Also, when using color remember that it is best practice to have no more than 8 categories because the colors become increasingly less distinct afterward.

2)Visualise the distribution of carat, partitioned by price.

###Answer

ggplot(diamonds, aes(x = cut_number(price, 10), y = carat)) +
  geom_boxplot() +
  coord_flip() +
  xlab("Price")

Plotted with a box plot with 10 equal-width bins of $2,000. The argument boundary = 0 ensures that first bin is $0–$2,000

ggplot(diamonds, aes(x = cut_width(price, 2000, boundary = 0), y = carat)) +
  geom_boxplot(varwidth = TRUE) +
  coord_flip() +
  xlab("Price")

3)How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?

ggplot(data = diamonds, mapping = aes(x = cut_width(carat, 1, boundary = 0), y = price)) +
  geom_boxplot(varwidth = TRUE) +
  labs(x = "carat", y = "price")

ggplot(data = diamonds, mapping = aes(x = cut_interval(carat, n = 3), y = price)) +
  geom_boxplot(varwidth = TRUE) +
  labs(x = "carat", y = "price")

ggplot(data = diamonds, mapping = aes(x = cut_number(carat, 5), y = price)) +
  geom_boxplot() +
  labs(x = "carat", y = "price")

The first thing to note is that there are are far more “smaller” carats (carat sizes < 3) than there are “larger” carats (carat sizes > 3). In fact, only 40 of the diamonds in our dataset containing 53,940 observations have a carat size larger than 3. In fact, about 68% of diamonds in our dataset have sizes smaller than or equal to 1.

This is most evident by the boxplot using cut_number(). As you can see the smallest boxplot is able to achieve the same number of data points using only a range of [.2, .35] whereas the last box requires a range of [1.13, 5.01] to achieve the same number of diamonds.

There is a clear positive relationship between carat size and price. As either increases, so too does the other. There are several other factors to consider other than carat size, but what is perhaps most interesting is that the most significant increases occur when increasing a whole number in carat size (0:1, 1:2, 2:3) but the effect appears to sort of taper off after 3.This effect might reverse with more data points, but it is odd that there is not the price increase we would expect.

diamonds %>%
  filter(carat >= 3) %>%
  select(carat, cut, color, price) %>%
  ggplot(mapping = aes(x = carat, y = price)) +
  geom_point()

diamonds %>%
  filter(carat >= 3) %>%
  select(carat, cut, color, price) %>%
  group_by(cut, color) %>%
  mutate(
    count = n()
  ) %>%
  ungroup() %>%
  mutate(
    prop = count/sum(count)
  ) %>%
  ggplot(mapping = aes(x = cut, y = color)) +
  geom_count()

diamonds %>%
  filter(carat >= 3) %>%
  select(carat, cut, color, price) %>%
  group_by(cut, color) %>%
  mutate(
    count = n()
  ) %>%
  ungroup() %>%
  mutate(
    prop = count/sum(count)
  ) %>%
  ggplot(mapping = aes(x = carat, y = cut_number(price, 5))) +
  geom_boxplot() +
  labs(x = "carat", y = "price")

4)Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.

diamonds %>%
  filter(between(carat, 0, 2.5)) %>%
  mutate(carat = cut_width(carat, 1)) %>%
  ggplot(aes(cut, price)) +
  geom_boxplot() +
  scale_y_log10() +
  facet_wrap(~ carat)

ggplot(data = diamonds, mapping = aes(x = cut_number(carat, 5), y = price)) +
  geom_boxplot(aes(color = cut))

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_boxplot(aes(color = cut_number(carat, 5)))

5)Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.

###Answer

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4,11), ylim = c(4,11))

Why is a scatterplot a better display than a binned plot for this case?

A scatterplot is a better display for this case because there is a strong relationship between x and y. Were we to bin the plots, values belonging to x may mistakenly be categorized as outliers even though they fit strongly with the bivariate relationship.

Warning: Removed 2 rows containing missing values (geom_path).

stat_bin() using bins = 30. Pick better value with binwidth.

Warning: Removed 48599 rows containing non-finite values (stat_bin).

Warning: Removed 1 rows containing missing values (geom_bar).

Warning: Removed 2 rows containing non-finite values (stat_bin).

Attaching package: ‘ggstance’

The following objects are masked from ‘package:ggplot2’:

geom_errorbarh, GeomErrorbarh

stat_bin() using bins = 30. Pick better value with binwidth.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.