1 Introduction

Package tidyverse is a set of packages that work in harmony. We will use the ggplot2 package to produce our data visualizations. This package is part of the tidyverse package. As we move forward, we will utilize some of the other packages loaded via tidyverse.

library(tidyverse)

The ggplot2 package comes with a data set called diamonds. Let’s look at it below. To obtain further details type ?diamonds in your console window.

glimpse(diamonds)
Observations: 53,940
Variables: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very G...
$ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, ...
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI...
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
$ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
$ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...

2 Data frame review

Data set diamonds is stored in R as a tibble. This allows for a convenient way to view the data frame in the console. Type diamonds in your console to see.

Let’s start with something we want to investigate. What is the relationship between a diamond’s price and its quality? Quality can be thought of as a function of carat, cut, color, clarity, and the numeric measurements.

Let’s compare the prices of Fair and Ideal diamonds.

summary(diamonds$price[diamonds$cut == "Fair"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    337    2050    3282    4359    5206   18574 
summary(diamonds$price[diamonds$cut == "Ideal"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    326     878    1810    3458    4678   18806 

Interesting. The median, mean, and third quartile prices for diamonds rated Fair exceed those rated Ideal. Continue on below to investigate further.

2.1 Exercises

  1. Variables color and clarity are both factors. Use the function levels to see the levels of each variable. They are sorted from worst to best.

  2. Compare prices of the worst and best diamonds in terms of color and clarity. What do you notice?

  3. Use the function table to see how many diamonds there are for each cut level.

  4. Add a new variable to diamonds called price.per.carat that represents the price per carat.

  5. Hypothesize as to why lower quality diamonds may be more expensive.

3 Visualizations with ggplot

3.1 Exercises

Recreate each plot below. Comment on any interesting trends/relationships you observe.

Plot 1

Plot 2

Plot 3

Plot 4

Plot 5

4 References

  1. Grolemund, G., & Wickham, H. (2019). R for Data Science. https://r4ds.had.co.nz/