Code
# Loading package(s)
library(tidyverse)
library(skimr)
library(nycflights13)
library(lvplot)
data(diamonds)
data(flights)Data Science 1 with R (STAT 301-1)
The goal of this lab is to begin to use data visualization and transformation skills to explore data in a systematic way. We call this exploratory data analysis (EDA).
# Loading package(s)
library(tidyverse)
library(skimr)
library(nycflights13)
library(lvplot)
data(diamonds)
data(flights)This lab utilizes the diamonds and flights datasets contained in packages ggplot2 and nycflights13, respectively. Documentation/codebook can be accessed with ?diamonds and ?flights, provided you have installed and loaded nycflights13 and tidyverse in your current R session.
There are a lot of resources out there to get help and insights into using RStudio. Please visit and explore the following:
Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond; think about how we might determine which dimension of a diamond is its length, width, and depth.
diamonds %>%
skim_without_charts(x, y, z)| Name | Piped data |
| Number of rows | 53940 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| x | 0 | 1 | 5.73 | 1.12 | 0 | 4.71 | 5.70 | 6.54 | 10.74 |
| y | 0 | 1 | 5.73 | 1.14 | 0 | 4.72 | 5.71 | 6.54 | 58.90 |
| z | 0 | 1 | 3.54 | 0.71 | 0 | 2.91 | 3.53 | 4.04 | 31.80 |
diamonds %>%
filter(x == 0)# A tibble: 8 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1.07 Ideal F SI2 61.6 56 4954 0 6.62 0
2 1 Very Good H VS2 63.3 53 5139 0 0 0
3 1.14 Fair G VS1 57.5 67 6381 0 0 0
4 1.56 Ideal G VS2 62.2 54 12800 0 0 0
5 1.2 Premium D VVS1 62.1 59 15686 0 0 0
6 2.25 Premium H SI2 62.8 59 18034 0 0 0
7 0.71 Good F SI2 64.1 60 2130 0 0 0
8 0.71 Good F SI2 64.1 60 2130 0 0 0
diamonds %>%
filter(y == 0)# A tibble: 7 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1 Very Good H VS2 63.3 53 5139 0 0 0
2 1.14 Fair G VS1 57.5 67 6381 0 0 0
3 1.56 Ideal G VS2 62.2 54 12800 0 0 0
4 1.2 Premium D VVS1 62.1 59 15686 0 0 0
5 2.25 Premium H SI2 62.8 59 18034 0 0 0
6 0.71 Good F SI2 64.1 60 2130 0 0 0
7 0.71 Good F SI2 64.1 60 2130 0 0 0
diamonds %>%
filter(z == 0)# A tibble: 20 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1 Premium G SI2 59.1 59 3142 6.55 6.48 0
2 1.01 Premium H I1 58.1 59 3167 6.66 6.6 0
3 1.1 Premium G SI2 63 59 3696 6.5 6.47 0
4 1.01 Premium F SI2 59.2 58 3837 6.5 6.47 0
5 1.5 Good G I1 64 61 4731 7.15 7.04 0
6 1.07 Ideal F SI2 61.6 56 4954 0 6.62 0
7 1 Very Good H VS2 63.3 53 5139 0 0 0
8 1.15 Ideal G VS2 59.2 56 5564 6.88 6.83 0
9 1.14 Fair G VS1 57.5 67 6381 0 0 0
10 2.18 Premium H SI2 59.4 61 12631 8.49 8.45 0
11 1.56 Ideal G VS2 62.2 54 12800 0 0 0
12 2.25 Premium I SI1 61.3 58 15397 8.52 8.42 0
13 1.2 Premium D VVS1 62.1 59 15686 0 0 0
14 2.2 Premium H SI1 61.2 59 17265 8.42 8.37 0
15 2.25 Premium H SI2 62.8 59 18034 0 0 0
16 2.02 Premium H VS2 62.7 53 18207 8.02 7.95 0
17 2.8 Good G SI2 63.8 58 18788 8.9 8.85 0
18 0.71 Good F SI2 64.1 60 2130 0 0 0
19 0.71 Good F SI2 64.1 60 2130 0 0 0
20 1.12 Premium G I1 60.4 59 2383 6.71 6.67 0
diamonds_adj <- diamonds %>%
mutate(z = if_else(z == 0, NA_real_, z), y = if_else(y == 0, NA_real_, y), x = if_else(x == 0, NA_real_, x))diamonds_adj %>%
select(x, y, z) %>%
summary() x y z
Min. : 3.730 Min. : 3.680 Min. : 1.07
1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.91
Median : 5.700 Median : 5.710 Median : 3.53
Mean : 5.732 Mean : 5.735 Mean : 3.54
3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.04
Max. :10.740 Max. :58.900 Max. :31.80
NA's :8 NA's :7 NA's :20
diamonds %>%
ggplot() +
geom_density(aes(x), color = "red", linetype = 1) +
geom_density(aes(y), color = "blue", linetype = 2) +
geom_density(aes(x), color = "green", linetype = 3) +
theme_minimal() +
ggtitle("Distributions of x, y, and z variables") +
xlim(c(0, 20))tidy_data <-
diamonds %>%
select(x, y, z) %>%
pivot_longer(cols = everything(), names_to = "measurement")
tidy_data %>%
ggplot(aes(x= value, y = measurement, fill = measurement)) + geom_boxplot(varwidth = TRUE) + coord_cartesian(xlim = c(0,10)) + theme_minimal()First off, after using the skim() function, we can see that all three variables have missing values, so we’ll translate them into NA values. From there, we can see that the z value is much less than the y and z measurements, which have nearly uniform distributions. Per the documentation for the diamonds dataset in RStudio, x is equal to the length, y is equal to the width, and z is equal to the depth. Should we not have that information, we would want to search the standard length, width, and depth for the specific diamonds (sorting by cuts). Certain diamonds have 1:1 length:width ratios, but when one is higher than the other, it is the width that increases, and depth would be less than both. Thus, since y > x > z, we’d be able to identify which variable was which.
Explore the distribution of price. Do you discover anything unusual or surprising? Hint: Carefully think about the binwidth and make sure you try a wide range of values.
diamonds %>%
ggplot(aes(x = color, y= price)) +
geom_boxplot(aes(fill = cut))diamonds %>%
filter(price < 4000) %>%
ggplot()+
geom_histogram(mapping = aes(x = price ), binwidth = 20)diamonds %>%
filter(price > 4000) %>%
ggplot()+
geom_histogram(mapping = aes(x = price ), binwidth = 60)With the standard deviation of price being nearly 4000, a bin width in that general area would seemingly allow for a neater visulation then one with a smaller bar width, but since we’re doing with such a wide range of values, filtering would be optimal. The greatest peak is apparent between $500 and $1000, with there surprisingly being no diamonds sold in the $1500 range. Among diamonds prices higher than the mean, most come between $4000 and $5000, but in general most diamonds sold fall under $1000, and it becomes even less common after the $2500 range. Thus, those priced beyond that certainly are of a coveted cut.
How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
diamonds %>%
filter(carat == .99) %>%
count()# A tibble: 1 × 1
n
<int>
1 23
diamonds %>%
filter(carat == 1) %>%
count()# A tibble: 1 × 1
n
<int>
1 1558
You would assume that the artificial “cut-off” of a 1 carat diamond makes it in more demand. Demand increases price, which increases incentive, so if the buyer wants a 1-carat diamond, the producer is going to want to assure that the diamonds that are being made meet that threshold. As such, this large disparity is not surprising, and there may be a “rounding-up” of 0.99 carat diamonds happening here in a similar fashion to what one may do with their height.
What is the major difference between using coord_cartesian() vs xlim() or ylim() to zoom in on a histogram/graphic?
If there are unusual values (outliers), coord_cartesian() will keep those values. On the other hand, xlim() and ylim() will simply throw away data outside the limits. Thus, if you want to see whether there are unusual values and where they are, coord_cartesian is the better argument to use.
What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
Missing values in the histogram are excluded, but, in the bar chart, they show up as N/A. This is because histogram is for continious variables, with bar charts being applicable for categorical variables.
What does na.rm = TRUE do in mean() and sum()? What happens when it is not included and NA values are present?
“na.rm = TRUE” removes N/A values from the dataset being used to calculate mean and sum. If not for that, the sum and mean values would show up as 0 for mean and N/A for sum.
Use what you’ve learned in 7.5.1 A Categorical and continuous variable to improve this visualization of the departure times of cancelled vs. non-cancelled flights.
flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(color = cancelled), binwidth = 1/4)flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(x = sched_dep_time, y = after_stat(density))) +
geom_freqpoly(mapping = aes(color = cancelled), binwidth = 1/4)flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(x= cancelled, y = sched_dep_time)) + geom_boxplot()There are two possible ways to improve this data visualization. One way would be to switch what is being measured on the y-axis from count to density, which is standardized so that the area under each frequency polygon is one. Furthermore, a box plot is often seen as a better visualization for a continious variable being compared to a categorical variable, as it’s very easy to see the distribution of the scheduled departure time for cancelled flights verus non-cancelled flights; cancelled flights generally had a later scheduled departure time, which is not surprising.
What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships result in lower quality diamonds being more expensive?
diamonds %>%
ggplot(aes(x= color, y = price)) + geom_boxplot()diamonds %>%
ggplot(mapping = aes(x= clarity, y = price)) + geom_boxplot()diamonds %>%
ggplot(mapping = aes(x= cut, y = price)) + geom_boxplot()diamonds %>%
ggplot(mapping = aes(x= carat, y = price)) + geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))diamonds %>%
ggplot(mapping = aes(x= cut, y = carat)) + geom_boxplot()Carat is the variable most important for predicting the price of a diamond, with a clear positive relationship. There isn’t much of a clear relationship between cut and carat, with cuts deemed as “fair” actually having the most carats. Thus, as you’d expect, there’s also not much a relationship between price and cut, and since “fair” cuts seemed to have the most carats, those end up being the ones with the highest median price; the focus is clearly with how many carats are on the diamond.
Recreate the horizontal boxplot below without using coord_flip().
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
labs(
x = NULL,
y = NULL
) +
coord_flip()ggplot(data = mpg) +
geom_boxplot(mapping = aes(y = reorder(class, hwy, FUN = median), x = hwy)) +
labs(
x = NULL,
y = NULL
)Manually adjusting the x and y axis accomplishes the same goal as coord_flip().
One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values.” One approach to remedy this problem is the letter value plot. Install the lvplot package and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots? Check out this paper
ggplot(diamonds) +
geom_lv(mapping = aes(x = cut, y = price))Lv plots are more useful for larger datasets because they provide more than a five-number summary, have more accuracy, and deal with outliers much better; larger datasets are going to be more likely to have more unusual values.The distribution of price for the fair cut is condensed much lower than that of the premium cut, for instance, which would be expected based on the general relationship between cut and price.
Display the relationship between price and cut using geom_violin() and again using a faceted version of geom_histogram(). What are the pros and cons of each method?
diamonds %>%
ggplot(aes(x = cut, y= price)) +
geom_violin(trim = FALSE)diamonds %>%
ggplot(aes(x = price)) +
geom_histogram() +
facet_wrap(~ cut, ncol = 1, scales = "free_y")The violin plot and faceted version of the histogram make it very easy to see the shape of the distribution between cut and price. However, one issue with the faceted histogram is that the scales of price differ greatly based on the cut, which can make it very difficult to understand at first glance. The violin plot, meanwhile, can struggle to demonstrate the comparison of the different cuts, as it’s more difficult to understand the median price of each cut. For shape, these well, but they have notable flaws for comparison purposes.
Can you rescale the data used in the plot below to more clearly show the distribution of cut within color, or color within cut? Starting from the code below, use rescaling to show the distribution of (a) cut within color and (b) color within cut.
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))#joint distribution
diamonds %>%
count(color, cut) %>%
mutate(prop = n / sum(n)) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = prop))#cut within color
diamonds %>%
count(color, cut) %>%
group_by(color) %>%
mutate(prop = n / sum(n)) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = prop)) +
ggtitle("Dist of cut within color")#color within cut
diamonds %>%
count(color, cut) %>%
group_by(cut) %>%
mutate(prop = n / sum(n)) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = prop)) +
ggtitle("Dist of color within cut")The first plot is a joint distribution between cut and color, with the latter two showing the distribution of cut within color and color within cut.
In Exercise 13, why is it slightly better to use aes(x = color, y = cut) instead of aes(x = cut, y = color)?
Color has more variables, and it’s easier to read a plot that is wider as opposed to narrower. Additionally, with color on x-axis, it is easier to see that J colors are less common. Also, the top colors are at the top, which is generally more visually easy to understand, since the count is mainly correlated with cut in the joint distribution.
Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot hard to read? Improve the plot.
flights %>%
group_by(month, dest) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = factor(month), y = dest)) +
geom_tile(aes(fill = avg_dep_delay)) +
labs( x= "Month", y = "Destination", fill = "Average Departure Delay") +
ggtitle("Average departure delay by month and destination")flights %>%
group_by(dest) %>%
filter(n() > 100) %>%
ungroup() %>%
group_by(month, dest) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = factor(month), y = dest)) +
geom_tile(aes(fill = avg_dep_delay)) +
labs( x= "Month", y = "Destination", fill = "Average Departure Delay") +
ggtitle("Average departure delay by month and destination")flights %>%
group_by(dest) %>%
filter(n() > 100) %>%
ungroup() %>%
group_by(month, dest) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(y = factor(month), x = dest)) +
geom_tile(aes(fill = avg_dep_delay)) +
labs( x= "Destination", y = "Month", fill = "Average Departure Delay") +
ggtitle("Average departure delay by month and destination")flights %>%
group_by(dest) %>%
filter(n() > 100) %>%
ungroup() %>%
group_by(month, dest) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ungroup() %>%
filter(dest %in% c("ORD", "ABQ", "HOU", "SFO")) %>%
ggplot(aes(y = factor(month), x = dest)) +
geom_tile(aes(fill = avg_dep_delay)) +
labs( x= "Destination", y = "Month", fill = "Average Departure Delay") +
ggtitle("Average departure delay by month and destination")Since there are so many different destinations, the geom_tile plot is very difficult to read. To improve this, we can try to filter out destinations + month combinations that are infrequent combinations, while also filtering to the specific destinations were are interested in.
Use the smaller dataset defined below for this exercise. Visualize the distribution of carat broken down by price. Construct 2 graphics, one using cut_width() and the other using cut_number().
smaller <- diamonds %>%
filter(carat < 3)smaller <- diamonds %>%
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))smaller <- diamonds %>%
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))How does the distribution of price differ for very large diamonds (carat \(\ge\) 3)? Is this result surprising? Why or why not?
larger <- diamonds %>%
filter(carat >= 3)
ggplot(data = larger, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, .5)))larger <- diamonds %>%
filter(carat >= 3)
ggplot(data = larger, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 5)))The distribution for larger diamonds has more variability, which isn’t surprising since the sample size is much smaller. Generally, you want to look at data with a large enough sample size to decrease the overall statistical noise and make the information more reliable, as splits like this can be slightly deceiving. That being said, since larger diamonds have generally are of greater prices, there is also going to be a higher deviation between two diamonds in the larger range. These two factors leads to a much less stable data distribution.
Combine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.
diamonds %>%
group_by(carat) %>%
filter(n() > 10) %>%
ungroup() %>%
group_by(cut, carat) %>%
summarise(avg_price = mean(price, na.rm = TRUE)) %>%
ggplot(aes(y = cut, x = carat)) +
geom_tile(aes(fill = avg_price)) +
labs(x= "Carat", y = "Cut", fill = "Average Price") +
ggtitle("Average price by cut and carat")diamonds %>%
ggplot(aes(x= carat, y = price, fill = cut)) + geom_boxplot()