Assignment 3 - Carlos Rivera (car808)

Step 2: Workbook exercises

library(tidyverse)
library(ggExtra)
library(ggridges)
library(patchwork)
library(nycflights13)

7.3.4 Exercises

1. Explore the distribution of each of the `x`, `y`, and `z` variables in `diamonds`. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

diamonds %>% 
  select(x:z) %>% 
  gather() %>% 
  ggplot()+
  aes(value, fill=key)+
  geom_histogram(binwidth = 0.1, alpha=0.6)+
  # geom_rug(aes(color=key))+
  geom_boxplot(width = 200)+
  facet_wrap(~key, scale = "free")+
  theme(legend.position = "none")

it can be seen that there are atypical data in both y and z.
Although you might think that most diamonds are completely circular when viewed from the top, in reality there may be slight variations, so x would represent the largest diameter and y the smallest diameter of the oval, and z would be the height of the diamond measured from the base to the top.

2. Explore the distribution of `price`. Do you discover anything unusual or surprising? (Hint: Carefully think about the `binwidth` and make sure you try a wide range of values.)

p1 = ggplot(diamonds)+
  aes(price)+
  geom_histogram()

p2 = ggplot(diamonds)+
  aes(price)+
  geom_histogram(binwidth = 10)

p1 + p2

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

by changing the binwidth to a low value binwidth = 10 it can be observed that there is a price range where there is no data, thus generating two distributions, on the left where most of the data accumulates with a pronounced peak, and after the interval a distribution with a long tail on the right, in which a slight peak is formed below $5000.

3. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

c0.99 = sum(diamonds$carat == 0.99)
c1.00 = sum(diamonds$carat == 1.00)

There are 23 diamonds with a carat = 0.99
There are 1558 diamonds with a carat = 1.00
This is due to the tendency of the carat value to accumulate to values of 1 decimal place, which may be due to the precision or ease with which the carat values are determined, where being so close to 1, it is preferred to round up to the higher value, thus generating a bias in the values.

4. Compare and contrast `coord_cartesian()` vs `xlim()` or `ylim()` when zooming in on a histogram. What happens if you leave `binwidth` unset? What happens if you try and zoom so only half a bar shows?

p0 = ggplot(diamonds)+
  aes(price)+
  geom_histogram()+
  geom_vline(xintercept = c(2000,3000), color = "blue")

p1 = ggplot(diamonds)+
  aes(price)+
  geom_histogram()+
  coord_cartesian(xlim = c(2000,3000))+
  geom_vline(xintercept = c(2000,3000), color = "blue")

p2 = ggplot(diamonds)+
  aes(price)+
  geom_histogram()+
  xlim(2000,3000)+
  geom_vline(xintercept = c(2000,3000), color = "blue")

p0 / (p1 + p2)

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 47807 rows containing non-finite outside the scale range
(`stat_bin()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

here we can observe that in the case of using xlim inside coord_cartesian() only the graph is zoomed-in, but it keeps all the parameters of the parent graph, as far as ylim and binwidth are concerned. In contrast, using directly the xlim function to define the limits recalculates the parameters to adapt them to the limits of the new chart, the resulting histogram has to start from a filtered dataset for the desired limits, hence the binwidth and ylim are different.

7.4.1 Exercises

1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

set.seed(123)
x1 = diamonds$cut
x1[sample(length(x1), length(x1)*0.1)] = NA

x2 = diamonds$carat
x2[sample(length(x2), length(x2)*0.1)] = NA

p1 = ggplot() + aes(x1) + geom_bar()
p2 = ggplot() + aes(x2) + geom_histogram()

p1 + p2

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 5394 rows containing non-finite outside the scale range
(`stat_bin()`).

In the case of geom_bar() when plotting a categorical variable, it does not return any warning message, if the resulting graph presents a bar with the new category NA.
In the case of geom_histogram(), the first thing is that it does return a warning message: Removed 5394 rows containing non-finite outside the scale range (stat_bin())., referring to the fact that to build the graph first filtered the data by removing the NA and then if the histogram is built with the non-NA values.

2. What does `na.rm = TRUE` do in `mean()` and `sum()`?

mean(x2); mean(x2, na.rm = T)

[1] NA

[1] 0.7979873

sum(x2); sum(x2, na.rm = T)

[1] NA

[1] 38739.09

As can be seen in the example, for the two functions mean() and sum() when used directly with the default arguments when the vector to operate on has NA values the return will be NA as well, and using the argument na.rm = TRUE implies that the missing values (NA) will be removed and the calculation will be performed with the remaining values that have finite values. In any case these functions return neither error nor warning in this situation.
This behavior can also be observed in other functions of the base package, such as min(), max(), median(), sd(), var().

7.5.1.1 Exercises

2. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

diamonds %>%
  filter(between(y, 2, 20)) %>%
  filter(between(z, 2, 10)) %>%
  pivot_longer(-c(cut, color, clarity, price)) %>%
  ggplot()+
  aes(value, price)+
  geom_point()+
  facet_wrap(~name, scale = "free")

We can observe that one of the variables that correlates most with price and therefore a great candidate for price prediction is carat.

diamonds %>%
  filter(carat < 2.5) %>%
  ggplot()+
  aes(carat, cut)+
  geom_density_ridges(alpha=0.5)

Picking joint bandwidth of 0.0625

The density plot shows how there is a relationship between the carat value and the cut quality, where for the ideal cut group the carat is mainly clustered below 0.5, in contrast the fair cut level presents the peak at 1, this concentration of data towards the value 1 is inversely proportional to the cut quality.

diamonds %>%
  filter(between(z, 2, 7)) %>%
  ggplot()+
  aes(cut, price, fill = cut(carat, 5))+
  geom_boxplot()

By subdividing the carat variable into 5 categories by intervals of equal rank, we can identify a group of diamonds with the lowest cut level (fair) whose carat values are grouped in the 4.05 - 5.01 interval.

3. Install the ggstance package, and create a horizontal boxplot. How does this compare to using `coord_flip()`?

# install.packages("ggstance")
library(ggstance)


Attaching package: 'ggstance'

The following objects are masked from 'package:ggplot2':

    geom_errorbarh, GeomErrorbarh

ggplot(diamonds)+
  aes(cut, price, fill=color)+
  geom_boxplot()+
  coord_flip()
ggplot(diamonds)+
  aes(price, cut, fill=color)+
  ggstance::geom_boxploth()

Warning: The following aesthetics were dropped during statistical transformation: x.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Warning: Using the `size` aesthetic with geom_segment was deprecated in ggplot2 3.4.0.
ℹ Please use the `linewidth` aesthetic instead.

Warning: Using the `size` aesthetic with geom_polygon was deprecated in ggplot2 3.4.0.
ℹ Please use the `linewidth` aesthetic instead.

The main difference between using ggstance::geom_boxploth and ggplot2:: coord_flip() is the order in which the boxes are plotted, when performing coord_flip() in geom_boxplot() the boxes are sorted in ascending order according to category, thus first (bottom) the color = D is plotted and last in the same group color = J, in contrast the order of the categories makes much more sense when using geom_boxploth sorting from top to bottom, which is in accordance with the order of the legend plot.

4. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using `geom_lv()` to display the distribution of price vs cut. What do you learn? How do you interpret the plots?

# install.packages("lvplot")
library(lvplot)

ggplot(diamonds)+
  aes(cut, price)+
  geom_boxplot()
ggplot(diamonds)+
  aes(cut, price)+
  lvplot::geom_lv()

This new letter-value plot represents more accurately the distribution of the data, by not only presenting a box for the 25th and 75th percentile interval but providing boxes for many more percentile ranges. This plot presents an intermediate point between a boxplot and a violin plot, being in boxes “discrete values” it can be more easily interpreted than a violin plot.

5. Compare and contrast `geom_violin()` with a facetted `geom_histogram()`, or a coloured `geom_freqpoly()`. What are the pros and cons of each method?

ggplot(diamonds)+
  aes(cut, price)+
  geom_violin()
ggplot(diamonds)+
  aes(price, fill = cut)+
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

ggplot(diamonds)+
  aes(price, color=cut)+
  geom_freqpoly(linewidth = 2)

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Although the three methods geom_violin(), geom_histogram(), and geom_frqpoly() are based on the same idea of representing the distribution of the data, in which we can also add geom_density(), they have different results depending on the objective. in the case of violin it allows us to see the distribution of each group independently and thus to identify where the data are grouped. In the case of the histogram, having overlapping groups makes it difficult to identify them, which can be useful to identify atypical groups. In the case of freqpoly, being lines allows to identify in a better way the differences between groups, mainly when the distributions overlap, unlike the violin that is not so easy to compare distributions between groups, and being lines in discrete values it is easier to interpret than the density plot with continuous values and smoothed lines.

6. If you have a small dataset, it’s sometimes useful to use `geom_jitter()` to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to `geom_jitter()`. List them and briefly describe what each one does.

# install.packages("ggbeeswarm")
library(ggbeeswarm)

ggplot(iris)+
  aes(Species, Sepal.Length)+
  geom_jitter()
ggplot(iris)+
  aes(Species, Sepal.Length)+
  ggbeeswarm::geom_beeswarm()
ggplot(iris)+
  aes(Species, Sepal.Length)+
  ggbeeswarm::geom_quasirandom()

7.5.2.1 Exercises

1. How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?

diamonds %>% 
  count(color, cut) %>% 
  ggplot() +
  aes(x = color, y = cut, fill = n) +
  geom_tile()+
  scale_fill_gradient(trans = "log10")

we can perform a transformation with a logarithmic scale inside scale_fill_gradient(), this allows us to keep the original scale of the data but to better distribute the color palette among the data.

2. Use `geom_tile()` together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?

flights %>% 
  mutate(month = factor(month, 1:12)) %>% 
  filter(arr_delay < 0) %>% 
  group_by(dest, month) %>% 
  summarise(mean_del = mean(arr_delay, na.rm=T)) %>% 
  ggplot()+
  aes(dest, month, fill = mean_del)+
  geom_tile()+
  scale_fill_viridis_c()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
        legend.position = "bottom")

`summarise()` has grouped output by 'dest'. You can override using the
`.groups` argument.

By having so many destinations, the graph becomes saturated, complicating the visualization and therefore its interpretability, since it is not possible to identify patterns or trends easily.

3. Why is it slightly better to use `aes(x = color, y = cut)` rather than `aes(x = cut, y = color)` in the example above?

p1 = diamonds %>%
  count(color, cut) %>%
  ggplot() +
  aes(x = color, y = cut) +
  geom_tile(mapping = aes(fill = n))

p2 = diamonds %>%
  count(color, cut) %>%
  ggplot() +
  aes(x = cut, y = color) +
  geom_tile(mapping = aes(fill = n))

p1 + p2

In this case when plotting two categorical variables it could be indistinct or trivial the position of the variables in each axis, but given the way each of the categories are coded it is better to have aes(x = color, y = cut) rather than aes(x = cut, y = color) since putting cut in the y axis allows a cleaner reading since the category names are longer and color in x has no complication since it is a single letter.

7.5.3.1 Exercises

1. Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using `cut_width()` vs `cut_number()`? How does that impact a visualisation of the 2d distribution of `carat` and `price`?

smaller = diamonds %>% 
  filter(carat < 3)

p1 = ggplot(smaller, aes(carat, color = cut_number(carat, 6))) +
  geom_freqpoly()

p2 = ggplot(smaller, aes(carat, color = cut_width(carat, 0.52))) +
  geom_freqpoly()

p1 + p2

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Each of the two functions offers different results and depending on the objective it is convenient to use one or the other, on the one hand cut_number allows you to define the number of categories in which you want to convert the numerical variable, also this function respects the original range, minimum and maximum value, of the variable. on the other hand cut_width allows you to define the categories based on the range of each interval, for which the limits of the categories are extended to complete the range.

p1 = ggplot(data = smaller) +
  aes(x = carat, y = price)+
  geom_bin2d(mapping = aes(group = cut_number(carat, 6)), alpha=0.5)

p2 = ggplot(data = smaller) +
  aes(x = carat, y = price)+
  geom_bin2d(mapping = aes(group = cut_width(carat, 0.52)), alpha=0.5)

p1 + p2

`stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.

A conflict between cut_number and cut_width with geom_bin2d is that the categories generated by the graph for each polygon conflict with the categories of the new variable, so when using the alpha argument you see overlapping bands between groups, which means that you are double counting those values in each category.

2. Visualise the distribution of carat, partitioned by price.

p1 = ggplot(data = smaller) +
  aes(carat, cut_number(price, 10))+
  geom_boxplot()

p2 = ggplot(data = smaller) +
  aes(carat, cut_number(price, 10))+
  geom_density_ridges()

p1 + p2

Picking joint bandwidth of 0.02

3. How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?

ggplot(data = smaller) +
  aes(price, cut_number(carat, 10))+
  geom_boxplot()

It can be observed that there is a relationship between size and price, but something that can be highlighted is how the price variance increases as the size increases, while smaller diamonds remain in a narrow price range but the range of large diamonds can vary up to 3 and 4 times.

4. Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.

ggplot(smaller)+
  aes(carat, price)+
  geom_density_2d(contour_var = "ndensity")+
  facet_wrap(~cut)

5. Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of `x` and `y` values, which makes the points outliers even though their `x` and `y` values appear normal when examined separately.

smaller = diamonds %>% 
  filter(x > 3, y < 20)
xy = smaller %>% 
  select(x, y)

mu = colMeans(xy)
s = cov(xy)
D = sqrt(mahalanobis(xy, mu, s))

p1 = smaller %>% 
  ggplot()+
  aes(x, y) + 
  geom_point(color = ifelse(D > 10, "red", "#333"),
             size = ifelse(D > 10, 2, 1))+
  geom_rug(length = unit(0.3, "cm"), color="#ccc")+
  coord_equal()+
  theme_bw()+
  theme(legend.position = "none")

p2 = ggMarginal(p1, type = "boxplot", size = 10, fill="#ccc")

p2

Step 3: Generate your own exercises

Temporal and Spatial Trends in U.S. Border Crossings: Insights from Bureau of Transportation Statistics Data

Data Source: Border Crossing Entry Data

The Bureau of Transportation Statistics (BTS) Border Crossing Data offers summary statistics on inbound crossings at the U.S.-Canada and U.S.-Mexico borders at the port level. These data include trucks, trains, containers, buses, personal vehicles, passengers, and pedestrians. Collected by U.S. Customs and Border Protection (CBP) at ports of entry, the data show the number of vehicles, containers, passengers, or pedestrians entering the United States.

The first step is to review records that may have inconsistencies. In Figure 1, we identified two types of inconsistent records when comparing the number of vehicles with the number of passengers. The first occurs when the ratio between vehicles and passengers is less than 1, implying that more cars crossed than passengers. This is an unlikely scenario since the minimum should be an equal number of passengers and vehicles. The second problematic situation is when the ratio is very high; in our case, we set a threshold of 7, because most vehicles allow 5 passengers, but some allow up to 7. Ratios above this value indicate potential inconsistencies that should be filtered out.

Figure 2 illustrates the monthly total of people entering the U.S. from Canada or Mexico, showing a declining trend over the past three decades. When aligned with presidential terms, a clear correlation appears. Two main points stand out: firstly, a sharp drop in entries during 2020, associated with the pandemic, followed by a slow recovery over the last five years. Secondly, an oscillating pattern is primarily seen at the Canadian border.

Examining the data more closely reveals that the oscillatory pattern noted earlier becomes clearer when analyzing monthly data distributions. Because the number of border crossers varies greatly, it’s more effective to express these figures as percentages of the annual total for each port of entry, rather than using absolute numbers. In Figure 3, there’s a noticeable correlation between the number of entrants and the time of year. Border crossings with Canada show a marked increase during the summer months, decreasing as winter nears. In contrast, crossings with Mexico tend to stay consistent throughout the year, with most months close to 1/12, except for December, which likely sees a rise due to Christmas and New Year, and February, which shows a sharp decline.

Finally, in Figure 4, the spatial distribution of the average annual number of border crossers by state is shown. Idaho has the lowest numbers, likely because of its border with Canada. The highest figures are along the Mexico borders in California and Texas, each with over 30 million people entering the US annually.