L04 Exploratory Data Analysis

Data Science 1 with R (STAT 301-1)

Author

Justin Dunbar

Overview

The goal of this lab is to begin to use data visualization and transformation skills to explore data in a systematic way. We call this exploratory data analysis (EDA).

Load Packages

Code
# Loading package(s)
library(tidyverse)
library(skimr)
library(nycflights13)
library(lvplot)
data(diamonds)
data(flights)

Datasets

This lab utilizes the diamonds and flights datasets contained in packages ggplot2 and nycflights13, respectively. Documentation/codebook can be accessed with ?diamonds and ?flights, provided you have installed and loaded nycflights13 and tidyverse in your current R session.

Exercises

Exercise 1

There are a lot of resources out there to get help and insights into using RStudio. Please visit and explore the following:

Done

Exercise 2

Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond; think about how we might determine which dimension of a diamond is its length, width, and depth.

Code
diamonds %>%
  skim_without_charts(x, y, z)
Data summary
Name Piped data
Number of rows 53940
Number of columns 10
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
x 0 1 5.73 1.12 0 4.71 5.70 6.54 10.74
y 0 1 5.73 1.14 0 4.72 5.71 6.54 58.90
z 0 1 3.54 0.71 0 2.91 3.53 4.04 31.80
Code
diamonds %>%
  filter(x == 0)
# A tibble: 8 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  1.07 Ideal     F     SI2      61.6    56  4954     0  6.62     0
2  1    Very Good H     VS2      63.3    53  5139     0  0        0
3  1.14 Fair      G     VS1      57.5    67  6381     0  0        0
4  1.56 Ideal     G     VS2      62.2    54 12800     0  0        0
5  1.2  Premium   D     VVS1     62.1    59 15686     0  0        0
6  2.25 Premium   H     SI2      62.8    59 18034     0  0        0
7  0.71 Good      F     SI2      64.1    60  2130     0  0        0
8  0.71 Good      F     SI2      64.1    60  2130     0  0        0
Code
diamonds %>%
  filter(y == 0)
# A tibble: 7 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  1    Very Good H     VS2      63.3    53  5139     0     0     0
2  1.14 Fair      G     VS1      57.5    67  6381     0     0     0
3  1.56 Ideal     G     VS2      62.2    54 12800     0     0     0
4  1.2  Premium   D     VVS1     62.1    59 15686     0     0     0
5  2.25 Premium   H     SI2      62.8    59 18034     0     0     0
6  0.71 Good      F     SI2      64.1    60  2130     0     0     0
7  0.71 Good      F     SI2      64.1    60  2130     0     0     0
Code
diamonds %>%
  filter(z == 0)
# A tibble: 20 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1    Premium   G     SI2      59.1    59  3142  6.55  6.48     0
 2  1.01 Premium   H     I1       58.1    59  3167  6.66  6.6      0
 3  1.1  Premium   G     SI2      63      59  3696  6.5   6.47     0
 4  1.01 Premium   F     SI2      59.2    58  3837  6.5   6.47     0
 5  1.5  Good      G     I1       64      61  4731  7.15  7.04     0
 6  1.07 Ideal     F     SI2      61.6    56  4954  0     6.62     0
 7  1    Very Good H     VS2      63.3    53  5139  0     0        0
 8  1.15 Ideal     G     VS2      59.2    56  5564  6.88  6.83     0
 9  1.14 Fair      G     VS1      57.5    67  6381  0     0        0
10  2.18 Premium   H     SI2      59.4    61 12631  8.49  8.45     0
11  1.56 Ideal     G     VS2      62.2    54 12800  0     0        0
12  2.25 Premium   I     SI1      61.3    58 15397  8.52  8.42     0
13  1.2  Premium   D     VVS1     62.1    59 15686  0     0        0
14  2.2  Premium   H     SI1      61.2    59 17265  8.42  8.37     0
15  2.25 Premium   H     SI2      62.8    59 18034  0     0        0
16  2.02 Premium   H     VS2      62.7    53 18207  8.02  7.95     0
17  2.8  Good      G     SI2      63.8    58 18788  8.9   8.85     0
18  0.71 Good      F     SI2      64.1    60  2130  0     0        0
19  0.71 Good      F     SI2      64.1    60  2130  0     0        0
20  1.12 Premium   G     I1       60.4    59  2383  6.71  6.67     0
Code
diamonds_adj <- diamonds %>%
  mutate(z = if_else(z == 0, NA_real_, z), y = if_else(y == 0, NA_real_, y), x = if_else(x == 0, NA_real_, x))
Code
diamonds_adj %>%
  select(x, y, z) %>%
summary()
       x                y                z        
 Min.   : 3.730   Min.   : 3.680   Min.   : 1.07  
 1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.91  
 Median : 5.700   Median : 5.710   Median : 3.53  
 Mean   : 5.732   Mean   : 5.735   Mean   : 3.54  
 3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.04  
 Max.   :10.740   Max.   :58.900   Max.   :31.80  
 NA's   :8        NA's   :7        NA's   :20     
Code
diamonds %>%
ggplot() +
  geom_density(aes(x), color = "red", linetype = 1) +
  geom_density(aes(y), color = "blue", linetype = 2) + 
  geom_density(aes(x), color = "green", linetype = 3) +
    theme_minimal() +
    ggtitle("Distributions of x, y, and z variables") +
    xlim(c(0, 20))

Code
tidy_data <-
diamonds %>%
  select(x, y, z) %>%
  pivot_longer(cols = everything(), names_to = "measurement")

tidy_data %>%
ggplot(aes(x= value, y = measurement, fill = measurement)) + geom_boxplot(varwidth = TRUE) + coord_cartesian(xlim = c(0,10)) + theme_minimal()

Solution

First off, after using the skim() function, we can see that all three variables have missing values, so we’ll translate them into NA values. From there, we can see that the z value is much less than the y and z measurements, which have nearly uniform distributions. Per the documentation for the diamonds dataset in RStudio, x is equal to the length, y is equal to the width, and z is equal to the depth. Should we not have that information, we would want to search the standard length, width, and depth for the specific diamonds (sorting by cuts). Certain diamonds have 1:1 length:width ratios, but when one is higher than the other, it is the width that increases, and depth would be less than both. Thus, since y > x > z, we’d be able to identify which variable was which.

Exercise 3

Explore the distribution of price. Do you discover anything unusual or surprising? Hint: Carefully think about the binwidth and make sure you try a wide range of values.

Code
diamonds %>%
ggplot(aes(x = color, y= price)) +
  geom_boxplot(aes(fill = cut))

Code
diamonds %>%
  filter(price < 4000) %>%
ggplot()+
  geom_histogram(mapping = aes(x = price ), binwidth = 20)

Code
diamonds %>%
  filter(price > 4000) %>%
ggplot()+
  geom_histogram(mapping = aes(x = price ), binwidth = 60)

Solution

With the standard deviation of price being nearly 4000, a bin width in that general area would seemingly allow for a neater visulation then one with a smaller bar width, but since we’re doing with such a wide range of values, filtering would be optimal. The greatest peak is apparent between $500 and $1000, with there surprisingly being no diamonds sold in the $1500 range. Among diamonds prices higher than the mean, most come between $4000 and $5000, but in general most diamonds sold fall under $1000, and it becomes even less common after the $2500 range. Thus, those priced beyond that certainly are of a coveted cut.

Exercise 4

How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

Code
diamonds %>%
  filter(carat == .99) %>%
  count()
# A tibble: 1 × 1
      n
  <int>
1    23
Code
diamonds %>%
  filter(carat == 1) %>%
  count()
# A tibble: 1 × 1
      n
  <int>
1  1558

Solution

You would assume that the artificial “cut-off” of a 1 carat diamond makes it in more demand. Demand increases price, which increases incentive, so if the buyer wants a 1-carat diamond, the producer is going to want to assure that the diamonds that are being made meet that threshold. As such, this large disparity is not surprising, and there may be a “rounding-up” of 0.99 carat diamonds happening here in a similar fashion to what one may do with their height.

Exercise 5

What is the major difference between using coord_cartesian() vs xlim() or ylim() to zoom in on a histogram/graphic?

Solution

If there are unusual values (outliers), coord_cartesian() will keep those values. On the other hand, xlim() and ylim() will simply throw away data outside the limits. Thus, if you want to see whether there are unusual values and where they are, coord_cartesian is the better argument to use.

Exercise 6

What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

Solution

Missing values in the histogram are excluded, but, in the bar chart, they show up as N/A. This is because histogram is for continious variables, with bar charts being applicable for categorical variables.

Exercise 7

What does na.rm = TRUE do in mean() and sum()? What happens when it is not included and NA values are present?

Solution

“na.rm = TRUE” removes N/A values from the dataset being used to calculate mean and sum. If not for that, the sum and mean values would show up as 0 for mean and N/A for sum.

Exercise 8

Use what you’ve learned in 7.5.1 A Categorical and continuous variable to improve this visualization of the departure times of cancelled vs. non-cancelled flights.

Code
flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(sched_dep_time)) + 
    geom_freqpoly(mapping = aes(color = cancelled), binwidth = 1/4)
Code
flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(x = sched_dep_time, y = after_stat(density))) +
    geom_freqpoly(mapping = aes(color = cancelled), binwidth = 1/4)

Code
flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(x= cancelled, y = sched_dep_time)) + geom_boxplot()

Solution

There are two possible ways to improve this data visualization. One way would be to switch what is being measured on the y-axis from count to density, which is standardized so that the area under each frequency polygon is one. Furthermore, a box plot is often seen as a better visualization for a continious variable being compared to a categorical variable, as it’s very easy to see the distribution of the scheduled departure time for cancelled flights verus non-cancelled flights; cancelled flights generally had a later scheduled departure time, which is not surprising.

Exercise 9

What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships result in lower quality diamonds being more expensive?

Code
diamonds %>%
ggplot(aes(x= color, y = price)) + geom_boxplot()

Code
diamonds %>%
ggplot(mapping = aes(x= clarity, y = price)) + geom_boxplot()

Code
diamonds %>%
ggplot(mapping = aes(x= cut, y = price)) + geom_boxplot()

Code
diamonds %>%
ggplot(mapping = aes(x= carat, y = price)) + geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

Code
diamonds %>%
ggplot(mapping = aes(x= cut, y = carat)) + geom_boxplot()

Solution

Carat is the variable most important for predicting the price of a diamond, with a clear positive relationship. There isn’t much of a clear relationship between cut and carat, with cuts deemed as “fair” actually having the most carats. Thus, as you’d expect, there’s also not much a relationship between price and cut, and since “fair” cuts seemed to have the most carats, those end up being the ones with the highest median price; the focus is clearly with how many carats are on the diamond.

Exercise 10

Recreate the horizontal boxplot below without using coord_flip().

Code
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  labs(
    x = NULL,
    y = NULL
  ) +
  coord_flip()
Code
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(y = reorder(class, hwy, FUN = median), x = hwy)) +
  labs(
    x = NULL,
    y = NULL
  )

Solution

Manually adjusting the x and y axis accomplishes the same goal as coord_flip().

Exercise 11

One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values.” One approach to remedy this problem is the letter value plot. Install the lvplot package and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots? Check out this paper

Code
ggplot(diamonds) + 
  geom_lv(mapping = aes(x = cut, y = price))

Solution

Lv plots are more useful for larger datasets because they provide more than a five-number summary, have more accuracy, and deal with outliers much better; larger datasets are going to be more likely to have more unusual values.The distribution of price for the fair cut is condensed much lower than that of the premium cut, for instance, which would be expected based on the general relationship between cut and price.

Exercise 12

Display the relationship between price and cut using geom_violin() and again using a faceted version of geom_histogram(). What are the pros and cons of each method?

Code
diamonds %>%
ggplot(aes(x = cut, y= price)) + 
  geom_violin(trim = FALSE)

Code
diamonds %>%
ggplot(aes(x = price)) + 
  geom_histogram() +
  facet_wrap(~ cut, ncol = 1, scales = "free_y")

Solution

The violin plot and faceted version of the histogram make it very easy to see the shape of the distribution between cut and price. However, one issue with the faceted histogram is that the scales of price differ greatly based on the cut, which can make it very difficult to understand at first glance. The violin plot, meanwhile, can struggle to demonstrate the comparison of the different cuts, as it’s more difficult to understand the median price of each cut. For shape, these well, but they have notable flaws for comparison purposes.

Exercise 13

Can you rescale the data used in the plot below to more clearly show the distribution of cut within color, or color within cut? Starting from the code below, use rescaling to show the distribution of (a) cut within color and (b) color within cut.

Code
diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))
Code
#joint distribution
diamonds %>% 
  count(color, cut) %>%  
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = prop))

Code
#cut within color
diamonds %>% 
  count(color, cut) %>%  
  group_by(color) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = prop)) + 
    ggtitle("Dist of cut within color")

Code
#color within cut
diamonds %>% 
  count(color, cut) %>%  
  group_by(cut) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = prop)) +
    ggtitle("Dist of color within cut")

Solution

The first plot is a joint distribution between cut and color, with the latter two showing the distribution of cut within color and color within cut.

Exercise 14

In Exercise 13, why is it slightly better to use aes(x = color, y = cut) instead of aes(x = cut, y = color)?

Solution

Color has more variables, and it’s easier to read a plot that is wider as opposed to narrower. Additionally, with color on x-axis, it is easier to see that J colors are less common. Also, the top colors are at the top, which is generally more visually easy to understand, since the count is mainly correlated with cut in the joint distribution.

Exercise 15

Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot hard to read? Improve the plot.

Code
flights %>%
  group_by(month, dest) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = factor(month), y = dest)) +
    geom_tile(aes(fill = avg_dep_delay)) +
  labs( x= "Month", y = "Destination", fill = "Average Departure Delay") +
    ggtitle("Average departure delay by month and destination")

Code
flights %>%
  group_by(dest) %>%
  filter(n() > 100) %>%
  ungroup() %>%
  group_by(month, dest) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = factor(month), y = dest)) +
    geom_tile(aes(fill = avg_dep_delay)) +
  labs( x= "Month", y = "Destination", fill = "Average Departure Delay") +
    ggtitle("Average departure delay by month and destination")

Code
flights %>%
  group_by(dest) %>%
  filter(n() > 100) %>%
  ungroup() %>%
  group_by(month, dest) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(y = factor(month), x = dest)) +
    geom_tile(aes(fill = avg_dep_delay)) +
  labs( x= "Destination", y = "Month", fill = "Average Departure Delay") +
    ggtitle("Average departure delay by month and destination")

Code
flights %>%
  group_by(dest) %>%
  filter(n() > 100) %>%
  ungroup() %>%
  group_by(month, dest) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ungroup() %>%
filter(dest %in% c("ORD", "ABQ", "HOU", "SFO")) %>% 
  ggplot(aes(y = factor(month), x = dest)) +
    geom_tile(aes(fill = avg_dep_delay)) +
  labs( x= "Destination", y = "Month", fill = "Average Departure Delay") +
    ggtitle("Average departure delay by month and destination")

Solution

Since there are so many different destinations, the geom_tile plot is very difficult to read. To improve this, we can try to filter out destinations + month combinations that are infrequent combinations, while also filtering to the specific destinations were are interested in.

Exercise 16

Use the smaller dataset defined below for this exercise. Visualize the distribution of carat broken down by price. Construct 2 graphics, one using cut_width() and the other using cut_number().

Code
smaller <- diamonds %>% 
  filter(carat < 3)
Code
smaller <- diamonds %>% 
  filter(carat < 3)

ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

Code
smaller <- diamonds %>% 
  filter(carat < 3)

ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_number(carat, 20)))

Exercise 17

How does the distribution of price differ for very large diamonds (carat \(\ge\) 3)? Is this result surprising? Why or why not?

Code
larger <- diamonds %>% 
  filter(carat >= 3)

ggplot(data = larger, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, .5)))

Code
larger <- diamonds %>% 
  filter(carat >= 3)

ggplot(data = larger, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_number(carat, 5)))

Solution

The distribution for larger diamonds has more variability, which isn’t surprising since the sample size is much smaller. Generally, you want to look at data with a large enough sample size to decrease the overall statistical noise and make the information more reliable, as splits like this can be slightly deceiving. That being said, since larger diamonds have generally are of greater prices, there is also going to be a higher deviation between two diamonds in the larger range. These two factors leads to a much less stable data distribution.

Exercise 18

Combine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.

Code
diamonds %>%
  group_by(carat) %>%
  filter(n() > 10) %>%
  ungroup() %>%
  group_by(cut, carat) %>%
  summarise(avg_price = mean(price, na.rm = TRUE)) %>%
  ggplot(aes(y = cut, x = carat)) +
    geom_tile(aes(fill = avg_price)) +
  labs(x= "Carat", y = "Cut", fill = "Average Price") +
    ggtitle("Average price by cut and carat")

Code
diamonds %>%
ggplot(aes(x= carat, y = price, fill = cut)) + geom_boxplot()