Introduction

The purpose of this short code-through is to provide a simple and friendly tour of the ‘diamonds’ dataset in R.
The dataset comes from the ggplot2 package and contains information on nearly 54,000 diamonds, including their price, weight (carat), color, clarity, and cut.

I chose this dataset because: - it is easy to load and explore, - it contains both numeric and categorical variables, - it is ideal for visualization, - and it is a fun dataset for beginners who want to practice data analysis.

In this tutorial, we will:

  1. Load the diamonds dataset
  2. Look at the variables it includes
  3. Explore some simple summary statistics
  4. Create basic visualizations using ggplot2

By the end, you will understand how to explore any dataset in R using the same steps.

## Load Required Packages
library(ggplot2)   # contains the diamonds dataset
library(dplyr)     # helpful for data exploration

## Load the diamonds dataset
data("diamonds")

## Preview the first few rows
head(diamonds)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
## Structure of the dataset
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## Summary statistics
summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

#Understanding the diamonds dataset

In this step, we loaded the diamonds dataset, which is built into the ggplot2 package.
Because it is included with the package, we do not need to download or import any files we simply call data("diamonds") to make it available in our R session.

After loading the dataset, we used a few basic R functions to understand what it contains: - head(diamonds) shows the first six rows of the data so we can see actual examples. - str(diamonds) displays the structure of the dataset, including each variable name and type (e.g., numeric, factor). - summary(diamonds) provides summary statistics such as minimum, maximum, median, and quartiles for each numeric variable.

These commands help us quickly get familiar with the dataset before doing any analysis or visualization. This kind of initial exploration is an essential first step whenever you work with a new dataset.

Exploring Key Variables

The diamonds dataset contains 10 variables describing different characteristics of diamonds. Here are some of the most important ones:

Before creating any visualizations, it is useful to look more closely at a few of these variables.

## Explore the distribution of key variables

summary(diamonds$carat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2000  0.4000  0.7000  0.7979  1.0400  5.0100
summary(diamonds$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823
table(diamonds$cut)
## 
##      Fair      Good Very Good   Premium     Ideal 
##      1610      4906     12082     13791     21551
table(diamonds$color)
## 
##     D     E     F     G     H     I     J 
##  6775  9797  9542 11292  8304  5422  2808
table(diamonds$clarity)
## 
##    I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF 
##   741  9194 13065 12258  8171  5066  3655  1790

These summaries give us a quick idea of how the dataset is distributed. For example:

This kind of variable-level exploration helps us think about useful visualizations we might create next.

Visualizing the diamonds Dataset

Now that we have explored the structure and summary of the dataset, we can start creating simple visualizations using the ggplot2 package. These plots help us understand patterns in the data and make the dataset more meaningful.

## Histogram of Diamond Carat (Weight)

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.1, fill = "skyblue", color = "white") +
  labs(
    title = "Distribution of Diamond Carat Values",
    x = "Carat (Weight)",
    y = "Count of Diamonds"
  ) +
  theme_minimal()

The histogram shows the distribution of diamond weights (carat). Most diamonds are relatively small (below 1 carat), and the number of diamonds decreases as the carat size increases. This makes sense because larger diamonds are more rare and more expensive.

## Density Plot: Price Distribution by Diamond Color

ggplot(diamonds, aes(x = price, fill = color)) +
  geom_density(alpha = 0.6) +
  scale_fill_brewer(palette = "Spectral") +
  labs(
    title = "Price Density by Diamond Color",
    x = "Price (USD)",
    y = "Density",
    fill = "Color Grade"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14))

This density plot compares price distributions across diamond color grades. Diamonds with better color grades (closer to D) tend to have higher prices. The overlapping curves show how color influences the price range.

## Faceted Scatterplot: Price vs Carat by Clarity

ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
  geom_point(alpha = 0.4) +
  scale_color_brewer(palette = "Dark2") +
  facet_wrap(~ clarity, ncol = 4) +
  labs(
    title = "Price vs Carat Faceted by Clarity",
    x = "Carat",
    y = "Price (USD)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14))

This faceted scatterplot creates separate panels for each clarity category. It makes it easier to compare how price and carat vary across different kinds of diamonds.

## Heatmap of Median Price by Cut and Color

diamonds %>%
  group_by(cut, color) %>%
  summarise(median_price = median(price)) %>%
  ggplot(aes(x = color, y = cut, fill = median_price)) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(option = "plasma") +
  labs(
    title = "Median Diamond Price by Cut and Color",
    x = "Color Grade",
    y = "Cut Quality",
    fill = "Median Price"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14))

This heatmap shows how median diamond price changes depending on both cut and color. Darker colors indicate higher prices. The pattern helps reveal the combined effect of these two characteristics.

How This Dataset Could Be Used

The diamonds dataset is useful for a wide range of learning and analysis tasks. Because it contains both numeric and categorical variables and a large number of observations, it is a popular choice for practicing data visualization and statistical modeling. Here are a few examples of how this dataset might be used:

1. Predicting Diamond Prices

Since the dataset includes price along with characteristics like weight, clarity, and cut, it can be used to build simple regression or machine learning models. A model could help estimate a fair price for a diamond based on its features.

2. Exploring Relationships Between Variables

The dataset allows beginners to practice identifying patterns, such as: - how carat size affects price, - how cut quality relates to price, - whether color or clarity has a stronger influence on value.

3. Practicing Data Visualization Techniques

It is commonly used to demonstrate: - histograms, - scatterplots, - boxplots, - density plots, - faceted plots. The variety of variables makes it perfect for learning ggplot2.

4. Teaching Data Cleaning and Wrangling

With 10 variables and thousands of rows, the dataset is ideal for showing: - filtering, - grouping, - summarizing, - reshaping data, - basic dplyr workflows.

Overall, the diamonds dataset is an excellent tool for teaching fundamental concepts in data science, and the skills learned here can be applied to many other datasets.

Conclusion

In this code-through, we explored the diamonds dataset, examined its structure, summarized key variables, and created several visualizations to understand how diamond characteristics relate to price. The dataset offers a rich and accessible way to practice data exploration in R, especially with tools like dplyr and ggplot2.

These same steps—loading data, examining structure, summarizing variables,and visualizing relationships can be applied to almost any dataset. By understanding this workflow, beginners gain a solid foundation for future data science work.