The purpose of this short code-through is to provide a simple and
friendly tour of the ‘diamonds’ dataset in R.
The dataset comes from the ggplot2 package and contains
information on nearly 54,000 diamonds, including their price, weight
(carat), color, clarity, and cut.
I chose this dataset because: - it is easy to load and explore, - it contains both numeric and categorical variables, - it is ideal for visualization, - and it is a fun dataset for beginners who want to practice data analysis.
In this tutorial, we will:
diamonds datasetBy the end, you will understand how to explore any dataset in R using the same steps.
## Load Required Packages
library(ggplot2) # contains the diamonds dataset
library(dplyr) # helpful for data exploration
## Load the diamonds dataset
data("diamonds")
## Preview the first few rows
head(diamonds)## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
#Understanding the diamonds dataset
In this step, we loaded the diamonds dataset, which is
built into the ggplot2 package.
Because it is included with the package, we do not need to download or
import any files we simply call data("diamonds") to make it
available in our R session.
After loading the dataset, we used a few basic R functions to
understand what it contains: - head(diamonds) shows the
first six rows of the data so we can see actual examples. -
str(diamonds) displays the structure of the dataset,
including each variable name and type (e.g., numeric, factor). -
summary(diamonds) provides summary statistics such as
minimum, maximum, median, and quartiles for each numeric variable.
These commands help us quickly get familiar with the dataset before doing any analysis or visualization. This kind of initial exploration is an essential first step whenever you work with a new dataset.
The diamonds dataset contains 10 variables describing
different characteristics of diamonds. Here are some of the most
important ones:
Before creating any visualizations, it is useful to look more closely at a few of these variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2000 0.4000 0.7000 0.7979 1.0400 5.0100
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18823
##
## Fair Good Very Good Premium Ideal
## 1610 4906 12082 13791 21551
##
## D E F G H I J
## 6775 9797 9542 11292 8304 5422 2808
##
## I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
## 741 9194 13065 12258 8171 5066 3655 1790
These summaries give us a quick idea of how the dataset is distributed. For example:
carat tells us the minimum and maximum size of the
diamonds.price shows the range of prices, which helps us
understand how spread out the data is.cut, color, and clarity are
categorical variables, so we use table() to see how many
diamonds fall into each category.This kind of variable-level exploration helps us think about useful visualizations we might create next.
Now that we have explored the structure and summary of the dataset,
we can start creating simple visualizations using the
ggplot2 package. These plots help us understand patterns in
the data and make the dataset more meaningful.
## Histogram of Diamond Carat (Weight)
ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 0.1, fill = "skyblue", color = "white") +
labs(
title = "Distribution of Diamond Carat Values",
x = "Carat (Weight)",
y = "Count of Diamonds"
) +
theme_minimal()The histogram shows the distribution of diamond weights
(carat). Most diamonds are relatively small (below 1
carat), and the number of diamonds decreases as the carat size
increases. This makes sense because larger diamonds are more rare and
more expensive.
## Density Plot: Price Distribution by Diamond Color
ggplot(diamonds, aes(x = price, fill = color)) +
geom_density(alpha = 0.6) +
scale_fill_brewer(palette = "Spectral") +
labs(
title = "Price Density by Diamond Color",
x = "Price (USD)",
y = "Density",
fill = "Color Grade"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14))
This density plot compares price distributions across diamond color
grades. Diamonds with better color grades (closer to D) tend to have
higher prices. The overlapping curves show how color influences the
price range.
## Faceted Scatterplot: Price vs Carat by Clarity
ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point(alpha = 0.4) +
scale_color_brewer(palette = "Dark2") +
facet_wrap(~ clarity, ncol = 4) +
labs(
title = "Price vs Carat Faceted by Clarity",
x = "Carat",
y = "Price (USD)"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14))This faceted scatterplot creates separate panels for each clarity category. It makes it easier to compare how price and carat vary across different kinds of diamonds.
## Heatmap of Median Price by Cut and Color
diamonds %>%
group_by(cut, color) %>%
summarise(median_price = median(price)) %>%
ggplot(aes(x = color, y = cut, fill = median_price)) +
geom_tile(color = "white") +
scale_fill_viridis_c(option = "plasma") +
labs(
title = "Median Diamond Price by Cut and Color",
x = "Color Grade",
y = "Cut Quality",
fill = "Median Price"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14))
This heatmap shows how median diamond price changes depending on both
cut and color. Darker colors indicate higher prices. The pattern helps
reveal the combined effect of these two characteristics.
The diamonds dataset is useful for a wide range of
learning and analysis tasks. Because it contains both numeric and
categorical variables and a large number of observations, it is a
popular choice for practicing data visualization and statistical
modeling. Here are a few examples of how this dataset might be used:
Since the dataset includes price along with characteristics like weight, clarity, and cut, it can be used to build simple regression or machine learning models. A model could help estimate a fair price for a diamond based on its features.
The dataset allows beginners to practice identifying patterns, such as: - how carat size affects price, - how cut quality relates to price, - whether color or clarity has a stronger influence on value.
It is commonly used to demonstrate: - histograms, - scatterplots, -
boxplots, - density plots, - faceted plots. The variety of variables
makes it perfect for learning ggplot2.
With 10 variables and thousands of rows, the dataset is ideal for
showing: - filtering, - grouping, - summarizing, - reshaping data, -
basic dplyr workflows.
Overall, the diamonds dataset is an excellent tool for
teaching fundamental concepts in data science, and the skills learned
here can be applied to many other datasets.
In this code-through, we explored the diamonds dataset,
examined its structure, summarized key variables, and created several
visualizations to understand how diamond characteristics relate to
price. The dataset offers a rich and accessible way to practice data
exploration in R, especially with tools like dplyr and
ggplot2.
These same steps—loading data, examining structure, summarizing variables,and visualizing relationships can be applied to almost any dataset. By understanding this workflow, beginners gain a solid foundation for future data science work.