library(tidyverse)
library(ggExtra)
library(ggridges)
library(patchwork)
library(nycflights13)Assignment 3 - Carlos Rivera (car808)
Step 2: Workbook exercises
7.3.4 Exercises
1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
diamonds %>%
select(x:z) %>%
gather() %>%
ggplot()+
aes(value, fill=key)+
geom_histogram(binwidth = 0.1, alpha=0.6)+
# geom_rug(aes(color=key))+
geom_boxplot(width = 200)+
facet_wrap(~key, scale = "free")+
theme(legend.position = "none")it can be seen that there are atypical data in both
yandz.Although you might think that most diamonds are completely circular when viewed from the top, in reality there may be slight variations, so
xwould represent the largest diameter andythe smallest diameter of the oval, andzwould be the height of the diamond measured from the base to the top.
2. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
p1 = ggplot(diamonds)+
aes(price)+
geom_histogram()
p2 = ggplot(diamonds)+
aes(price)+
geom_histogram(binwidth = 10)
p1 + p2`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
- by changing the binwidth to a low value
binwidth = 10it can be observed that there is a price range where there is no data, thus generating two distributions, on the left where most of the data accumulates with a pronounced peak, and after the interval a distribution with a long tail on the right, in which a slight peak is formed below $5000.
3. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
c0.99 = sum(diamonds$carat == 0.99)
c1.00 = sum(diamonds$carat == 1.00)There are 23 diamonds with a carat = 0.99
There are 1558 diamonds with a carat = 1.00
This is due to the tendency of the carat value to accumulate to values of 1 decimal place, which may be due to the precision or ease with which the carat values are determined, where being so close to 1, it is preferred to round up to the higher value, thus generating a bias in the values.
4. Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
p0 = ggplot(diamonds)+
aes(price)+
geom_histogram()+
geom_vline(xintercept = c(2000,3000), color = "blue")
p1 = ggplot(diamonds)+
aes(price)+
geom_histogram()+
coord_cartesian(xlim = c(2000,3000))+
geom_vline(xintercept = c(2000,3000), color = "blue")
p2 = ggplot(diamonds)+
aes(price)+
geom_histogram()+
xlim(2000,3000)+
geom_vline(xintercept = c(2000,3000), color = "blue")
p0 / (p1 + p2)`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 47807 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).
here we can observe that in the case of using xlim inside coord_cartesian() only the graph is zoomed-in, but it keeps all the parameters of the parent graph, as far as ylim and binwidth are concerned. In contrast, using directly the xlim function to define the limits recalculates the parameters to adapt them to the limits of the new chart, the resulting histogram has to start from a filtered dataset for the desired limits, hence the binwidth and ylim are different.
7.4.1 Exercises
1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
set.seed(123)
x1 = diamonds$cut
x1[sample(length(x1), length(x1)*0.1)] = NA
x2 = diamonds$carat
x2[sample(length(x2), length(x2)*0.1)] = NA
p1 = ggplot() + aes(x1) + geom_bar()
p2 = ggplot() + aes(x2) + geom_histogram()
p1 + p2`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 5394 rows containing non-finite outside the scale range
(`stat_bin()`).
In the case of
geom_bar()when plotting a categorical variable, it does not return any warning message, if the resulting graph presents a bar with the new categoryNA.In the case of
geom_histogram(), the first thing is that it does return a warning message: Removed 5394 rows containing non-finite outside the scale range (stat_bin())., referring to the fact that to build the graph first filtered the data by removing theNAand then if the histogram is built with the non-NAvalues.
2. What does na.rm = TRUE do in mean() and sum()?
mean(x2); mean(x2, na.rm = T)[1] NA
[1] 0.7979873
sum(x2); sum(x2, na.rm = T)[1] NA
[1] 38739.09
As can be seen in the example, for the two functions
mean()andsum()when used directly with the default arguments when the vector to operate on hasNAvalues the return will beNAas well, and using the argumentna.rm = TRUEimplies that the missing values (NA) will be removed and the calculation will be performed with the remaining values that have finite values. In any case these functions return neither error nor warning in this situation.This behavior can also be observed in other functions of the
basepackage, such asmin(),max(),median(),sd(),var().
7.5.1.1 Exercises
3. Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?
# install.packages("ggstance")
library(ggstance)
Attaching package: 'ggstance'
The following objects are masked from 'package:ggplot2':
geom_errorbarh, GeomErrorbarh
ggplot(diamonds)+
aes(cut, price, fill=color)+
geom_boxplot()+
coord_flip()
ggplot(diamonds)+
aes(price, cut, fill=color)+
ggstance::geom_boxploth()Warning: The following aesthetics were dropped during statistical transformation: x.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Warning: Using the `size` aesthetic with geom_segment was deprecated in ggplot2 3.4.0.
ℹ Please use the `linewidth` aesthetic instead.
Warning: Using the `size` aesthetic with geom_polygon was deprecated in ggplot2 3.4.0.
ℹ Please use the `linewidth` aesthetic instead.
The main difference between using ggstance::geom_boxploth and ggplot2:: coord_flip() is the order in which the boxes are plotted, when performing coord_flip() in geom_boxplot() the boxes are sorted in ascending order according to category, thus first (bottom) the color = D is plotted and last in the same group color = J, in contrast the order of the categories makes much more sense when using geom_boxploth sorting from top to bottom, which is in accordance with the order of the legend plot.
4. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?
# install.packages("lvplot")
library(lvplot)
ggplot(diamonds)+
aes(cut, price)+
geom_boxplot()
ggplot(diamonds)+
aes(cut, price)+
lvplot::geom_lv()This new letter-value plot represents more accurately the distribution of the data, by not only presenting a box for the 25th and 75th percentile interval but providing boxes for many more percentile ranges. This plot presents an intermediate point between a boxplot and a violin plot, being in boxes “discrete values” it can be more easily interpreted than a violin plot.
5. Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?
ggplot(diamonds)+
aes(cut, price)+
geom_violin()
ggplot(diamonds)+
aes(price, fill = cut)+
geom_histogram()`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
ggplot(diamonds)+
aes(price, color=cut)+
geom_freqpoly(linewidth = 2)`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Although the three methods geom_violin(), geom_histogram(), and geom_frqpoly() are based on the same idea of representing the distribution of the data, in which we can also add geom_density(), they have different results depending on the objective. in the case of violin it allows us to see the distribution of each group independently and thus to identify where the data are grouped. In the case of the histogram, having overlapping groups makes it difficult to identify them, which can be useful to identify atypical groups. In the case of freqpoly, being lines allows to identify in a better way the differences between groups, mainly when the distributions overlap, unlike the violin that is not so easy to compare distributions between groups, and being lines in discrete values it is easier to interpret than the density plot with continuous values and smoothed lines.
6. If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.
# install.packages("ggbeeswarm")
library(ggbeeswarm)
ggplot(iris)+
aes(Species, Sepal.Length)+
geom_jitter()
ggplot(iris)+
aes(Species, Sepal.Length)+
ggbeeswarm::geom_beeswarm()
ggplot(iris)+
aes(Species, Sepal.Length)+
ggbeeswarm::geom_quasirandom()7.5.2.1 Exercises
1. How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?
diamonds %>%
count(color, cut) %>%
ggplot() +
aes(x = color, y = cut, fill = n) +
geom_tile()+
scale_fill_gradient(trans = "log10")we can perform a transformation with a logarithmic scale inside scale_fill_gradient(), this allows us to keep the original scale of the data but to better distribute the color palette among the data.
2. Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?
flights %>%
mutate(month = factor(month, 1:12)) %>%
filter(arr_delay < 0) %>%
group_by(dest, month) %>%
summarise(mean_del = mean(arr_delay, na.rm=T)) %>%
ggplot()+
aes(dest, month, fill = mean_del)+
geom_tile()+
scale_fill_viridis_c()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
legend.position = "bottom")`summarise()` has grouped output by 'dest'. You can override using the
`.groups` argument.
By having so many destinations, the graph becomes saturated, complicating the visualization and therefore its interpretability, since it is not possible to identify patterns or trends easily.
3. Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?
p1 = diamonds %>%
count(color, cut) %>%
ggplot() +
aes(x = color, y = cut) +
geom_tile(mapping = aes(fill = n))
p2 = diamonds %>%
count(color, cut) %>%
ggplot() +
aes(x = cut, y = color) +
geom_tile(mapping = aes(fill = n))
p1 + p2In this case when plotting two categorical variables it could be indistinct or trivial the position of the variables in each axis, but given the way each of the categories are coded it is better to have aes(x = color, y = cut) rather than aes(x = cut, y = color) since putting cut in the y axis allows a cleaner reading since the category names are longer and color in x has no complication since it is a single letter.
7.5.3.1 Exercises
1. Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?
smaller = diamonds %>%
filter(carat < 3)
p1 = ggplot(smaller, aes(carat, color = cut_number(carat, 6))) +
geom_freqpoly()
p2 = ggplot(smaller, aes(carat, color = cut_width(carat, 0.52))) +
geom_freqpoly()
p1 + p2`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Each of the two functions offers different results and depending on the objective it is convenient to use one or the other, on the one hand cut_number allows you to define the number of categories in which you want to convert the numerical variable, also this function respects the original range, minimum and maximum value, of the variable. on the other hand cut_width allows you to define the categories based on the range of each interval, for which the limits of the categories are extended to complete the range.
p1 = ggplot(data = smaller) +
aes(x = carat, y = price)+
geom_bin2d(mapping = aes(group = cut_number(carat, 6)), alpha=0.5)
p2 = ggplot(data = smaller) +
aes(x = carat, y = price)+
geom_bin2d(mapping = aes(group = cut_width(carat, 0.52)), alpha=0.5)
p1 + p2`stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.
A conflict between cut_number and cut_width with geom_bin2d is that the categories generated by the graph for each polygon conflict with the categories of the new variable, so when using the alpha argument you see overlapping bands between groups, which means that you are double counting those values in each category.
2. Visualise the distribution of carat, partitioned by price.
p1 = ggplot(data = smaller) +
aes(carat, cut_number(price, 10))+
geom_boxplot()
p2 = ggplot(data = smaller) +
aes(carat, cut_number(price, 10))+
geom_density_ridges()
p1 + p2Picking joint bandwidth of 0.02
3. How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?
ggplot(data = smaller) +
aes(price, cut_number(carat, 10))+
geom_boxplot()It can be observed that there is a relationship between size and price, but something that can be highlighted is how the price variance increases as the size increases, while smaller diamonds remain in a narrow price range but the range of large diamonds can vary up to 3 and 4 times.
4. Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.
ggplot(smaller)+
aes(carat, price)+
geom_density_2d(contour_var = "ndensity")+
facet_wrap(~cut)5. Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.
smaller = diamonds %>%
filter(x > 3, y < 20)
xy = smaller %>%
select(x, y)
mu = colMeans(xy)
s = cov(xy)
D = sqrt(mahalanobis(xy, mu, s))
p1 = smaller %>%
ggplot()+
aes(x, y) +
geom_point(color = ifelse(D > 10, "red", "#333"),
size = ifelse(D > 10, 2, 1))+
geom_rug(length = unit(0.3, "cm"), color="#ccc")+
coord_equal()+
theme_bw()+
theme(legend.position = "none")
p2 = ggMarginal(p1, type = "boxplot", size = 10, fill="#ccc")
p2Step 3: Generate your own exercises
Temporal and Spatial Trends in U.S. Border Crossings: Insights from Bureau of Transportation Statistics Data
Data Source: Border Crossing Entry Data
The Bureau of Transportation Statistics (BTS) Border Crossing Data offers summary statistics on inbound crossings at the U.S.-Canada and U.S.-Mexico borders at the port level. These data include trucks, trains, containers, buses, personal vehicles, passengers, and pedestrians. Collected by U.S. Customs and Border Protection (CBP) at ports of entry, the data show the number of vehicles, containers, passengers, or pedestrians entering the United States.
The first step is to review records that may have inconsistencies. In Figure 1, we identified two types of inconsistent records when comparing the number of vehicles with the number of passengers. The first occurs when the ratio between vehicles and passengers is less than 1, implying that more cars crossed than passengers. This is an unlikely scenario since the minimum should be an equal number of passengers and vehicles. The second problematic situation is when the ratio is very high; in our case, we set a threshold of 7, because most vehicles allow 5 passengers, but some allow up to 7. Ratios above this value indicate potential inconsistencies that should be filtered out.
Figure 2 illustrates the monthly total of people entering the U.S. from Canada or Mexico, showing a declining trend over the past three decades. When aligned with presidential terms, a clear correlation appears. Two main points stand out: firstly, a sharp drop in entries during 2020, associated with the pandemic, followed by a slow recovery over the last five years. Secondly, an oscillating pattern is primarily seen at the Canadian border.
Examining the data more closely reveals that the oscillatory pattern noted earlier becomes clearer when analyzing monthly data distributions. Because the number of border crossers varies greatly, it’s more effective to express these figures as percentages of the annual total for each port of entry, rather than using absolute numbers. In Figure 3, there’s a noticeable correlation between the number of entrants and the time of year. Border crossings with Canada show a marked increase during the summer months, decreasing as winter nears. In contrast, crossings with Mexico tend to stay consistent throughout the year, with most months close to 1/12, except for December, which likely sees a rise due to Christmas and New Year, and February, which shows a sharp decline.
Finally, in Figure 4, the spatial distribution of the average annual number of border crossers by state is shown. Idaho has the lowest numbers, likely because of its border with Canada. The highest figures are along the Mexico borders in California and Texas, each with over 30 million people entering the US annually.