The Question

Every engagement ring shopper hears about the Four Cs:

  • Carat — weight
  • Cut — shape quality
  • Color — D (colorless) to J (yellowish)
  • Clarity — internal flaws

Jewelers say all four drive the price, but the biggest one is carat. We use the diamonds dataset from ggplot2 to see how well carat alone predicts price.

Meet the Data

## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

About 54,000 diamonds, priced from around $300 to nearly $19,000, with weights from 0.2 to over 5 carats.

How Much Do Diamonds Cost?

Most diamonds cost under $5,000, but a long tail reaches past $18,000 — a classic right-skewed distribution.

Price vs Carat

Clear upward trend: bigger stones cost more. The blue line is the best-fit line we compute next.

The Linear Model

We predict price from carat with simple linear regression: \[ \text{Price} = \beta_0 + \beta_1 \cdot \text{Carat} + \varepsilon \]

The best \(\beta_0\) and \(\beta_1\) minimize the sum of squared errors: \[ \text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^{2} \]

  • \(\beta_0\) — intercept (where the line crosses the price axis)
  • \(\beta_1\) — slope (extra dollars per extra carat)

Fitting the Model in R

fit <- lm(price ~ carat, data = diamonds)
summary(fit)
## 
## Call:
## lm(formula = price ~ carat, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18585.3   -804.8    -18.9    537.4  12731.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2256.36      13.06  -172.8   <2e-16 ***
## carat        7756.43      14.07   551.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493 
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

The slope is about $7,756 per carat, and R-squared ≈ 0.85.

Interpreting the Fit

The slope \(\beta_1\) is the extra dollars per extra carat: \[ \hat{\beta}_1 \approx 7756 \text{ dollars per carat} \]

The coefficient of determination measures how much variation the model explains: \[ R^{2} = 1 - \frac{\text{SSE}}{\text{SST}} \approx 0.85 \]

So carat alone explains about 85% of the variation in diamond prices.

Takeaways

  • Carat is the biggest driver of diamond price. A one-carat increase adds roughly $7,756 to the predicted price.
  • The simple linear model explains about 85% of the variation in price — pretty good for a single predictor.
  • Lesson for ring shoppers: a heavier stone matters far more than any other single factor.