Assignment 3 - Carlos Rivera (car808)

Step 2: Workbook exercises

library(tidyverse)
library(ggExtra)
library(ggridges)
library(patchwork)
library(nycflights13)

7.3.4 Exercises

1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
diamonds %>% 
  select(x:z) %>% 
  gather() %>% 
  ggplot()+
  aes(value, fill=key)+
  geom_histogram(binwidth = 0.1, alpha=0.6)+
  # geom_rug(aes(color=key))+
  geom_boxplot(width = 200)+
  facet_wrap(~key, scale = "free")+
  theme(legend.position = "none")

  • it can be seen that there are atypical data in both y and z.

  • Although you might think that most diamonds are completely circular when viewed from the top, in reality there may be slight variations, so x would represent the largest diameter and y the smallest diameter of the oval, and z would be the height of the diamond measured from the base to the top.

2. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
p1 = ggplot(diamonds)+
  aes(price)+
  geom_histogram()

p2 = ggplot(diamonds)+
  aes(price)+
  geom_histogram(binwidth = 10)

p1 + p2
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

  • by changing the binwidth to a low value binwidth = 10 it can be observed that there is a price range where there is no data, thus generating two distributions, on the left where most of the data accumulates with a pronounced peak, and after the interval a distribution with a long tail on the right, in which a slight peak is formed below $5000.
3. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
c0.99 = sum(diamonds$carat == 0.99)
c1.00 = sum(diamonds$carat == 1.00)
  • There are 23 diamonds with a carat = 0.99

  • There are 1558 diamonds with a carat = 1.00

  • This is due to the tendency of the carat value to accumulate to values of 1 decimal place, which may be due to the precision or ease with which the carat values are determined, where being so close to 1, it is preferred to round up to the higher value, thus generating a bias in the values.

4. Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
p0 = ggplot(diamonds)+
  aes(price)+
  geom_histogram()+
  geom_vline(xintercept = c(2000,3000), color = "blue")

p1 = ggplot(diamonds)+
  aes(price)+
  geom_histogram()+
  coord_cartesian(xlim = c(2000,3000))+
  geom_vline(xintercept = c(2000,3000), color = "blue")

p2 = ggplot(diamonds)+
  aes(price)+
  geom_histogram()+
  xlim(2000,3000)+
  geom_vline(xintercept = c(2000,3000), color = "blue")

p0 / (p1 + p2)
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 47807 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

here we can observe that in the case of using xlim inside coord_cartesian() only the graph is zoomed-in, but it keeps all the parameters of the parent graph, as far as ylim and binwidth are concerned. In contrast, using directly the xlim function to define the limits recalculates the parameters to adapt them to the limits of the new chart, the resulting histogram has to start from a filtered dataset for the desired limits, hence the binwidth and ylim are different.

7.4.1 Exercises

1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
set.seed(123)
x1 = diamonds$cut
x1[sample(length(x1), length(x1)*0.1)] = NA

x2 = diamonds$carat
x2[sample(length(x2), length(x2)*0.1)] = NA

p1 = ggplot() + aes(x1) + geom_bar()
p2 = ggplot() + aes(x2) + geom_histogram()

p1 + p2
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 5394 rows containing non-finite outside the scale range
(`stat_bin()`).

  • In the case of geom_bar() when plotting a categorical variable, it does not return any warning message, if the resulting graph presents a bar with the new category NA.

  • In the case of geom_histogram(), the first thing is that it does return a warning message: Removed 5394 rows containing non-finite outside the scale range (stat_bin())., referring to the fact that to build the graph first filtered the data by removing the NA and then if the histogram is built with the non-NA values.

2. What does na.rm = TRUE do in mean() and sum()?
mean(x2); mean(x2, na.rm = T)
[1] NA
[1] 0.7979873
sum(x2); sum(x2, na.rm = T)
[1] NA
[1] 38739.09
  • As can be seen in the example, for the two functions mean() and sum() when used directly with the default arguments when the vector to operate on has NA values the return will be NA as well, and using the argument na.rm = TRUE implies that the missing values (NA) will be removed and the calculation will be performed with the remaining values that have finite values. In any case these functions return neither error nor warning in this situation.

  • This behavior can also be observed in other functions of the base package, such as min(), max(), median(), sd(), var().

7.5.1.1 Exercises

2. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
diamonds %>%
  filter(between(y, 2, 20)) %>%
  filter(between(z, 2, 10)) %>%
  pivot_longer(-c(cut, color, clarity, price)) %>%
  ggplot()+
  aes(value, price)+
  geom_point()+
  facet_wrap(~name, scale = "free")

We can observe that one of the variables that correlates most with price and therefore a great candidate for price prediction is carat.

diamonds %>%
  filter(carat < 2.5) %>%
  ggplot()+
  aes(carat, cut)+
  geom_density_ridges(alpha=0.5)
Picking joint bandwidth of 0.0625

The density plot shows how there is a relationship between the carat value and the cut quality, where for the ideal cut group the carat is mainly clustered below 0.5, in contrast the fair cut level presents the peak at 1, this concentration of data towards the value 1 is inversely proportional to the cut quality.

diamonds %>%
  filter(between(z, 2, 7)) %>%
  ggplot()+
  aes(cut, price, fill = cut(carat, 5))+
  geom_boxplot()

By subdividing the carat variable into 5 categories by intervals of equal rank, we can identify a group of diamonds with the lowest cut level (fair) whose carat values are grouped in the 4.05 - 5.01 interval.

3. Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?
# install.packages("ggstance")
library(ggstance)

Attaching package: 'ggstance'
The following objects are masked from 'package:ggplot2':

    geom_errorbarh, GeomErrorbarh
ggplot(diamonds)+
  aes(cut, price, fill=color)+
  geom_boxplot()+
  coord_flip()
ggplot(diamonds)+
  aes(price, cut, fill=color)+
  ggstance::geom_boxploth()
Warning: The following aesthetics were dropped during statistical transformation: x.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
Warning: Using the `size` aesthetic with geom_segment was deprecated in ggplot2 3.4.0.
ℹ Please use the `linewidth` aesthetic instead.
Warning: Using the `size` aesthetic with geom_polygon was deprecated in ggplot2 3.4.0.
ℹ Please use the `linewidth` aesthetic instead.

The main difference between using ggstance::geom_boxploth and ggplot2:: coord_flip() is the order in which the boxes are plotted, when performing coord_flip() in geom_boxplot() the boxes are sorted in ascending order according to category, thus first (bottom) the color = D is plotted and last in the same group color = J, in contrast the order of the categories makes much more sense when using geom_boxploth sorting from top to bottom, which is in accordance with the order of the legend plot.

4. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?
# install.packages("lvplot")
library(lvplot)

ggplot(diamonds)+
  aes(cut, price)+
  geom_boxplot()
ggplot(diamonds)+
  aes(cut, price)+
  lvplot::geom_lv()

This new letter-value plot represents more accurately the distribution of the data, by not only presenting a box for the 25th and 75th percentile interval but providing boxes for many more percentile ranges. This plot presents an intermediate point between a boxplot and a violin plot, being in boxes “discrete values” it can be more easily interpreted than a violin plot.

5. Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?
ggplot(diamonds)+
  aes(cut, price)+
  geom_violin()
ggplot(diamonds)+
  aes(price, fill = cut)+
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
ggplot(diamonds)+
  aes(price, color=cut)+
  geom_freqpoly(linewidth = 2)
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Although the three methods geom_violin(), geom_histogram(), and geom_frqpoly() are based on the same idea of representing the distribution of the data, in which we can also add geom_density(), they have different results depending on the objective. in the case of violin it allows us to see the distribution of each group independently and thus to identify where the data are grouped. In the case of the histogram, having overlapping groups makes it difficult to identify them, which can be useful to identify atypical groups. In the case of freqpoly, being lines allows to identify in a better way the differences between groups, mainly when the distributions overlap, unlike the violin that is not so easy to compare distributions between groups, and being lines in discrete values it is easier to interpret than the density plot with continuous values and smoothed lines.

6. If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.
# install.packages("ggbeeswarm")
library(ggbeeswarm)

ggplot(iris)+
  aes(Species, Sepal.Length)+
  geom_jitter()
ggplot(iris)+
  aes(Species, Sepal.Length)+
  ggbeeswarm::geom_beeswarm()
ggplot(iris)+
  aes(Species, Sepal.Length)+
  ggbeeswarm::geom_quasirandom()

7.5.2.1 Exercises

1. How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?
diamonds %>% 
  count(color, cut) %>% 
  ggplot() +
  aes(x = color, y = cut, fill = n) +
  geom_tile()+
  scale_fill_gradient(trans = "log10")

we can perform a transformation with a logarithmic scale inside scale_fill_gradient(), this allows us to keep the original scale of the data but to better distribute the color palette among the data.

2. Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?
flights %>% 
  mutate(month = factor(month, 1:12)) %>% 
  filter(arr_delay < 0) %>% 
  group_by(dest, month) %>% 
  summarise(mean_del = mean(arr_delay, na.rm=T)) %>% 
  ggplot()+
  aes(dest, month, fill = mean_del)+
  geom_tile()+
  scale_fill_viridis_c()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
        legend.position = "bottom")
`summarise()` has grouped output by 'dest'. You can override using the
`.groups` argument.

By having so many destinations, the graph becomes saturated, complicating the visualization and therefore its interpretability, since it is not possible to identify patterns or trends easily.

3. Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?
p1 = diamonds %>%
  count(color, cut) %>%
  ggplot() +
  aes(x = color, y = cut) +
  geom_tile(mapping = aes(fill = n))

p2 = diamonds %>%
  count(color, cut) %>%
  ggplot() +
  aes(x = cut, y = color) +
  geom_tile(mapping = aes(fill = n))

p1 + p2

In this case when plotting two categorical variables it could be indistinct or trivial the position of the variables in each axis, but given the way each of the categories are coded it is better to have aes(x = color, y = cut) rather than aes(x = cut, y = color) since putting cut in the y axis allows a cleaner reading since the category names are longer and color in x has no complication since it is a single letter.

7.5.3.1 Exercises

1. Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?
smaller = diamonds %>% 
  filter(carat < 3)

p1 = ggplot(smaller, aes(carat, color = cut_number(carat, 6))) +
  geom_freqpoly()

p2 = ggplot(smaller, aes(carat, color = cut_width(carat, 0.52))) +
  geom_freqpoly()

p1 + p2
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Each of the two functions offers different results and depending on the objective it is convenient to use one or the other, on the one hand cut_number allows you to define the number of categories in which you want to convert the numerical variable, also this function respects the original range, minimum and maximum value, of the variable. on the other hand cut_width allows you to define the categories based on the range of each interval, for which the limits of the categories are extended to complete the range.

p1 = ggplot(data = smaller) +
  aes(x = carat, y = price)+
  geom_bin2d(mapping = aes(group = cut_number(carat, 6)), alpha=0.5)

p2 = ggplot(data = smaller) +
  aes(x = carat, y = price)+
  geom_bin2d(mapping = aes(group = cut_width(carat, 0.52)), alpha=0.5)

p1 + p2
`stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.

A conflict between cut_number and cut_width with geom_bin2d is that the categories generated by the graph for each polygon conflict with the categories of the new variable, so when using the alpha argument you see overlapping bands between groups, which means that you are double counting those values in each category.

2. Visualise the distribution of carat, partitioned by price.
p1 = ggplot(data = smaller) +
  aes(carat, cut_number(price, 10))+
  geom_boxplot()

p2 = ggplot(data = smaller) +
  aes(carat, cut_number(price, 10))+
  geom_density_ridges()

p1 + p2
Picking joint bandwidth of 0.02

3. How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?
ggplot(data = smaller) +
  aes(price, cut_number(carat, 10))+
  geom_boxplot()

It can be observed that there is a relationship between size and price, but something that can be highlighted is how the price variance increases as the size increases, while smaller diamonds remain in a narrow price range but the range of large diamonds can vary up to 3 and 4 times.

4. Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.
ggplot(smaller)+
  aes(carat, price)+
  geom_density_2d(contour_var = "ndensity")+
  facet_wrap(~cut)

5. Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.
smaller = diamonds %>% 
  filter(x > 3, y < 20)
xy = smaller %>% 
  select(x, y)

mu = colMeans(xy)
s = cov(xy)
D = sqrt(mahalanobis(xy, mu, s))

p1 = smaller %>% 
  ggplot()+
  aes(x, y) + 
  geom_point(color = ifelse(D > 10, "red", "#333"),
             size = ifelse(D > 10, 2, 1))+
  geom_rug(length = unit(0.3, "cm"), color="#ccc")+
  coord_equal()+
  theme_bw()+
  theme(legend.position = "none")

p2 = ggMarginal(p1, type = "boxplot", size = 10, fill="#ccc")

p2

Step 3: Generate your own exercises