Package tidyverse
is a set of packages that work in harmony. We will use the ggplot2
package to produce our data visualizations. This package is part of the tidyverse
package. As we move forward, we will utilize some of the other packages loaded via tidyverse
.
library(tidyverse)
The ggplot2 package comes with a data set called diamonds
. Let’s look at it below. To obtain further details type ?diamonds
in your console window.
glimpse(diamonds)
Observations: 53,940
Variables: 10
$ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
$ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very G...
$ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, ...
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI...
$ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
$ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
$ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
$ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
$ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
$ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...
Data set diamonds
is stored in R as a tibble. This allows for a convenient way to view the data frame in the console. Type diamonds
in your console to see.
Let’s start with something we want to investigate. What is the relationship between a diamond’s price and its quality? Quality can be thought of as a function of carat, cut, color, clarity, and the numeric measurements.
Let’s compare the prices of Fair and Ideal diamonds.
summary(diamonds$price[diamonds$cut == "Fair"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
337 2050 3282 4359 5206 18574
summary(diamonds$price[diamonds$cut == "Ideal"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
326 878 1810 3458 4678 18806
Interesting. The median, mean, and third quartile prices for diamonds rated Fair exceed those rated Ideal. Continue on below to investigate further.
Variables color and clarity are both factors. Use the function levels
to see the levels of each variable. They are sorted from worst to best.
Compare prices of the worst and best diamonds in terms of color and clarity. What do you notice?
Use the function table
to see how many diamonds there are for each cut level.
Add a new variable to diamonds
called price.per.carat
that represents the price per carat.
Hypothesize as to why lower quality diamonds may be more expensive.
Recreate each plot below. Comment on any interesting trends/relationships you observe.