Dataset Overview and Source

Diamonds Dataset

This analysis explores pricing and physical characteristics of over 53,000 diamonds to identify the key drivers of diamond price.

Data Source: Built-in R dataset (ggplot2 package) — originally compiled from AwesomeGems.com

Key Variables:

  • Carat: Diamond weight (0.20 – 5.01)
  • Cut: Quality of the cut (Fair → Good → Very Good → Premium → Ideal)
  • Color: Color grading from D (best) to J (worst)
  • Clarity: Clarity grading from I1 (worst) to IF (best)
  • Depth: Total depth percentage relative to diameter
  • Price: Price in US dollars ($326 – $18,823)

R Code for Data Preparation

Here is how the data is loaded and prepared for analysis:

library(ggplot2)
library(plotly)
library(dplyr)

data(diamonds)

set.seed(42)
diamonds_sample <- diamonds %>% sample_n(3000)

diamonds_sample$cut <- factor(diamonds_sample$cut,
  levels = c("Fair", "Good", "Very Good", "Premium", "Ideal"))

3D Plotly: Carat, Depth & Price

3D Plot Analysis

Key Observations:

  • Carat Drives Price: Diamonds above 2 carats cluster exclusively in the upper price tier, confirming carat weight as the primary price driver.

  • Depth is Neutral: Depth percentage is tightly bunched between 57% and 65% regardless of price, indicating depth has minimal standalone influence.

  • Cut Quality Spread: Ideal and Premium cuts appear across the full price range, showing that cut interacts with carat rather than acting alone.

  • Combined Insight: The 3D view makes clear that price escalation is most strongly tied to carat, with cut quality providing secondary stratification.

Plotly Scatter: Carat vs Price by Cut

ggplot Boxplot: Price by Cut Quality

ggplot Bar Chart: Count by Color Grade and Cut

Statistical Analysis: Descriptive Statistics

diamonds_sample %>%
  group_by(cut) %>%
  summarise(
    Count        = n(),
    Mean_Price   = round(mean(price), 0),
    Median_Price = round(median(price), 0),
    Mean_Carat   = round(mean(carat), 2),
    SD_Price     = round(sd(price), 0)
  )
## # A tibble: 5 × 6
##   cut       Count Mean_Price Median_Price Mean_Carat SD_Price
##   <ord>     <int>      <dbl>        <dbl>      <dbl>    <dbl>
## 1 Fair         96       4539         3204       1.11     3365
## 2 Good        286       3988         3006       0.85     3710
## 3 Very Good   695       3911         2759       0.81     3798
## 4 Premium     732       4685         3320       0.9      4502
## 5 Ideal      1191       3303         1790       0.69     3649

Summary Statistics: Interpretation

Key Findings:

  • Fair vs Ideal Paradox: Fair-cut diamonds have a higher mean price than Ideal cuts because Fair diamonds are typically cut from larger, heavier rough stones — carat weight overpowers the cut-quality penalty.

  • Ideal Cut Consistency: Ideal cuts show the tightest standard deviation relative to their mean, indicating more predictable and transparent pricing.

  • Good vs Very Good Gap: The jump in mean price from Good to Very Good cut is disproportionate to other steps, likely driven by carat differences within the categories rather than cut alone.

  • Market Share: Ideal cut has the highest count in the sample, confirming its dominance in the modern diamond retail market.

Statistical Analysis: Linear Regression

lm_model <- lm(price ~ carat + depth + table, data = diamonds_sample)
summary(lm_model)
## 
## Call:
## lm(formula = price ~ carat + depth + table, data = diamonds_sample)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11296.0   -803.6    -25.2    531.0  11147.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14529.94    1659.70   8.755  < 2e-16 ***
## carat        7755.25      59.94 129.392  < 2e-16 ***
## depth        -179.84      20.38  -8.822  < 2e-16 ***
## table         -99.62      13.40  -7.432 1.39e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1516 on 2996 degrees of freedom
## Multiple R-squared:  0.8521, Adjusted R-squared:  0.8519 
## F-statistic:  5753 on 3 and 2996 DF,  p-value: < 2.2e-16

Regression Analysis: Interpretation

Comprehensive Findings:

  • Carat Coefficient: Each additional carat adds approximately $7,756 to price (holding depth and table constant), confirming carat as the dominant predictor by a wide margin.

  • Model Fit (R²): The model explains roughly 85% of price variation, demonstrating that carat weight, depth, and table together are strong structural predictors of diamond price.

  • Depth & Table: Both carry statistically significant but small negative coefficients — extreme depth or wide tables slightly reduce value because they diminish visual brilliance.

  • Practical Takeaway: When budgeting for a diamond, carat weight should drive the primary decision; cut proportions (depth, table) provide secondary value optimization once carat is fixed.

Key Insights and Conclusions

Major Findings:

First: Carat weight is the overwhelmingly dominant price driver — the 3D scatter, 2D scatter plot, and linear regression all converge on this conclusion.

Second: Cut quality creates a price paradox where Fair-cut diamonds command higher average prices than Ideal cuts, driven entirely by their larger average carat size — reinforcing that carat trumps cut in raw price terms.

Third: Color grade distributes relatively evenly across cut qualities in the bar chart, suggesting that color and cut selection are largely independent production decisions.

Study Limitations: The sample of 3,000 records preserves rendering performance but reduces statistical power for rarer combinations such as large Fair-cut diamonds. Future work should incorporate clarity grade and a full GLM to isolate each attribute’s contribution.

Thank You

Dataset Source: ggplot2 built-in diamonds dataset (originally from AwesomeGems.com)

Tools Used: R, ggplot2, plotly, dplyr

References:

  • Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
  • R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.