In this exercise, we are going to dive a little deeper into wrangling and exploring the diamonds dataset using dplyr and ggplot.
General housekeeping items
Let’s begin by opening libraries and clearing the environment:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
rm(list=ls())
Loading files
In an earlier exercise, we explored the the “diamonds” dataset from the tidyverse. Let’s store the diamonds dataset into an object called “diamonds”. Information about this dataset can be found here.
diamonds <- diamonds
Wrangle and explore the dataset
Use dplyr to keep only diamonds less than 1.5 carats and store in a data frame called “diamonds_small”
diamonds_small <- diamonds %>%filter(carat <1.5)
Report summary statistics for the new dataset:
summary(diamonds_small)
carat cut color clarity depth
Min. :0.200 Fair : 1285 D: 6452 SI1 :11415 Min. :43.00
1st Qu.:0.370 Good : 4265 E: 9268 VS2 :10916 1st Qu.:61.10
Median :0.580 Very Good:10708 F: 8864 VS1 : 7435 Median :61.80
Mean :0.671 Premium :11552 G:10232 SI2 : 7230 Mean :61.74
3rd Qu.:1.000 Ideal :19895 H: 6916 VVS2 : 4861 3rd Qu.:62.50
Max. :1.490 I: 4045 VVS1 : 3587 Max. :79.00
J: 1928 (Other): 2261
table price x y
Min. :43.00 Min. : 326 Min. :0.000 Min. : 0.000
1st Qu.:56.00 1st Qu.: 880 1st Qu.:4.630 1st Qu.: 4.640
Median :57.00 Median : 1951 Median :5.380 Median : 5.380
Mean :57.37 Mean : 2844 Mean :5.471 Mean : 5.476
3rd Qu.:59.00 3rd Qu.: 4258 3rd Qu.:6.340 3rd Qu.: 6.340
Max. :79.00 Max. :18700 Max. :7.730 Max. :31.800
z
Min. : 0.000
1st Qu.: 2.840
Median : 3.310
Mean : 3.379
3rd Qu.: 3.920
Max. :31.800
Explore the data with simple visualizations and summary statistics
Graph the distribution of prices using a histogram from ggplot:
Notice above that there doesn’t seem to be a relation between diamond clarity and price. That is diamonds of higher clarity have similar or even lower prices (on average) than diamonds of lower clarity. Why do you think that is? Let’s explore below.
Investigate the relations between mulitple variables using scatter plots from ggplot:
ggplot(diamonds_small, aes(x = carat, y = price, color = clarity)) +geom_point(alpha =0.3)
More practice with dplyr (isolate and plot diamonds of similar size):
Why did we observe diamonds of higher clarity have similar or even lower prices (on average) than diamonds of lower clarity in the full dataset? What do the analyses above tell you about the relations between size, clarity, and price?