We want to explore the diamonds dataset graphically to get some ideas about what makes a diamond valuable.

We will work with a 10% sample to avoid long pauses as we complete tasks.

First load the tidyverse.

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

Take the sample and examine it.

diamonds %>%
   sample_frac(size=.1) %>% 
   mutate(ppc = price/carat) %>% 
   select(price, carat, ppc, cut, 
   color,clarity) ->
   lild
glimpse(lild)   
## Observations: 5,394
## Variables: 6
## $ price   <int> 1258, 644, 566, 14103, 2579, 1183, 2683, 3321, 12811, ...
## $ carat   <dbl> 0.56, 0.35, 0.27, 1.54, 0.76, 0.43, 0.75, 0.52, 2.29, ...
## $ ppc     <dbl> 2246.429, 1840.000, 2096.296, 9157.792, 3393.421, 2751...
## $ cut     <ord> Good, Premium, Ideal, Ideal, Very Good, Ideal, Premium...
## $ color   <ord> D, D, H, G, G, G, F, D, J, F, H, D, F, E, F, G, H, D, ...
## $ clarity <ord> SI2, SI1, IF, VS2, SI1, VVS1, SI1, VVS1, SI1, VS2, SI1...
summary(lild)
##      price             carat             ppc               cut      
##  Min.   :  326.0   Min.   :0.2100   Min.   : 1139   Fair     : 141  
##  1st Qu.:  980.2   1st Qu.:0.4000   1st Qu.: 2534   Good     : 506  
##  Median : 2479.0   Median :0.7100   Median : 3514   Very Good:1202  
##  Mean   : 3953.9   Mean   :0.8025   Mean   : 4034   Premium  :1360  
##  3rd Qu.: 5292.0   3rd Qu.:1.0400   3rd Qu.: 4980   Ideal    :2185  
##  Max.   :18818.0   Max.   :4.0100   Max.   :16726                   
##                                                                     
##  color       clarity    
##  D: 687   SI1    :1340  
##  E: 996   VS2    :1197  
##  F: 947   SI2    : 966  
##  G:1132   VS1    : 760  
##  H: 822   VVS2   : 510  
##  I: 542   VVS1   : 371  
##  J: 268   (Other): 250
table(lild$clarity)
## 
##   I1  SI2  SI1  VS2  VS1 VVS2 VVS1   IF 
##   77  966 1340 1197  760  510  371  173

How do you examine the relationship between a categorical explanatory variable and a quantitative response?

lild %>% ggplot(aes(x=cut,y=ppc)) + geom_boxplot()

lild %>% ggplot(aes(x=cut,y=ppc)) + geom_jitter()