Diamonds Exercise

In this exercise, we are going to dive a little deeper into wrangling and exploring the diamonds dataset using dplyr and ggplot.

General housekeeping items

Let’s begin by opening libraries and clearing the environment:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

rm(list=ls())

Loading files

In an earlier exercise, we explored the the “diamonds” dataset from the tidyverse. Let’s store the diamonds dataset into an object called “diamonds”. Information about this dataset can be found here.

diamonds <- diamonds

Wrangle and explore the dataset

Use dplyr to keep only diamonds less than 1.5 carats and store in a data frame called “diamonds_small”

diamonds_small <- diamonds %>%
  filter(carat < 1.5)

Report summary statistics for the new dataset:

summary(diamonds_small)

     carat              cut        color        clarity          depth      
 Min.   :0.200   Fair     : 1285   D: 6452   SI1    :11415   Min.   :43.00  
 1st Qu.:0.370   Good     : 4265   E: 9268   VS2    :10916   1st Qu.:61.10  
 Median :0.580   Very Good:10708   F: 8864   VS1    : 7435   Median :61.80  
 Mean   :0.671   Premium  :11552   G:10232   SI2    : 7230   Mean   :61.74  
 3rd Qu.:1.000   Ideal    :19895   H: 6916   VVS2   : 4861   3rd Qu.:62.50  
 Max.   :1.490                     I: 4045   VVS1   : 3587   Max.   :79.00  
                                   J: 1928   (Other): 2261                  
     table           price             x               y         
 Min.   :43.00   Min.   :  326   Min.   :0.000   Min.   : 0.000  
 1st Qu.:56.00   1st Qu.:  880   1st Qu.:4.630   1st Qu.: 4.640  
 Median :57.00   Median : 1951   Median :5.380   Median : 5.380  
 Mean   :57.37   Mean   : 2844   Mean   :5.471   Mean   : 5.476  
 3rd Qu.:59.00   3rd Qu.: 4258   3rd Qu.:6.340   3rd Qu.: 6.340  
 Max.   :79.00   Max.   :18700   Max.   :7.730   Max.   :31.800  
                                                                 
       z         
 Min.   : 0.000  
 1st Qu.: 2.840  
 Median : 3.310  
 Mean   : 3.379  
 3rd Qu.: 3.920  
 Max.   :31.800

Explore the data with simple visualizations and summary statistics

Graph the distribution of prices using a histogram from ggplot:

diamonds_small %>%
  ggplot(aes(x = price)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s try it again, this time using the natural logarithm of price:

diamonds_small %>%
  ggplot(aes(x = log(price))) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Practice summarizing data by groups with dplyr (determine average price for each level of clarity):

diamonds_small %>%
  group_by(clarity) %>%
  summarize(average_price = mean(price), 
            count_obs = n())

# A tibble: 8 × 3
  clarity average_price count_obs
  <ord>           <dbl>     <int>
1 I1              2515.       510
2 SI2             3181.      7230
3 SI1             2814.     11415
4 VS2             2812.     10916
5 VS1             2924.      7435
6 VVS2            2873.      4861
7 VVS1            2315.      3587
8 IF              2616.      1751

Combine dplyr and ggplot (i.e., “pipe” the summarized data into a ggplot):

diamonds_small %>%
  group_by(clarity) %>%
  summarize(average_price = mean(price), 
            count_obs = n()) %>% 
  ggplot(aes(x = clarity, y = average_price)) +
  geom_bar(stat ='identity')

Thought exercise

Notice above that there doesn’t seem to be a relation between diamond clarity and price. That is diamonds of higher clarity have similar or even lower prices (on average) than diamonds of lower clarity. Why do you think that is? Let’s explore below.

Investigate the relations between mulitple variables using scatter plots from ggplot:

ggplot(diamonds_small, aes(x = carat, y = price, color = clarity)) +
  geom_point(alpha = 0.3)

More practice with dplyr (isolate and plot diamonds of similar size):

diamonds_small %>%
  filter(carat >= 1 & carat <= 1.1) %>%
  group_by(clarity) %>%
  summarize(average_price = mean(price), count_obs = n())

# A tibble: 8 × 3
  clarity average_price count_obs
  <ord>           <dbl>     <int>
1 I1              2929.       151
2 SI2             4231.      2017
3 SI1             4952.      2162
4 VS2             6058.      1578
5 VS1             6717.       866
6 VVS2            8513.       491
7 VVS1            9077.       174
8 IF             11176.       129

diamonds_small %>%
  filter(carat >= 1 & carat <= 1.1) %>%
  group_by(clarity) %>%
  summarize(average_price = mean(price), count_obs = n()) %>%
  ggplot(aes(x = clarity, y = average_price)) +
  geom_bar(stat='identity')

Thought exercise - follow up

Why did we observe diamonds of higher clarity have similar or even lower prices (on average) than diamonds of lower clarity in the full dataset? What do the analyses above tell you about the relations between size, clarity, and price?