1. Executive Summary

Report Overview

Cut, color, clarity, and carat — the “4 C’s” of diamonds are considered the main factors in a diamond’s price. Using a subset of diamonds listed by jewelry retailer Blue Nile, we’ve explored the impact of the “4 C’s” on the price of diamonds with a particular focus on carat. Carat - a measurement of a diamond’s size — is the most immediately recognizable characteristic to a consumer, making it a particularly interesting attribute to analyze alongside a diamond’s price. Cut, which assesses a diamond’s light performance (a measure of a diamond’s brilliance, or how luminous it is) is said by Blue Nile to be the most important factor in determining a diamond’s price. We’ve also considered color, which measures how colorless the diamond is (the less color, the better), and clarity, which measures small imperfections (called inclusions) in the diamond.

Our subset of diamonds includes 1214 diamonds of the 2300 listed by Blue Nile. In general, the diamonds in our data have a median price of $1463.50 and contains many more diamonds below one carat than above. About 60% of the diamonds in our data are Ideal cut. Our data contains pretty even proportions of each color. Blue Nile only carries diamonds of relatively high clarity, with the lowest clarity diamonds still being solidly high-mid grade in the big picture. But our data contains a much greater proportion of diamonds with clarity on the lower end of the range that Blue Nile sells (which is ultimately a pretty nice clarity). The median prices of each sub-classification in color, clarity and cut are quite close.

Key Findings

Going with conventional wisdom, Blue Nile claims that the “4 C’s” all impact the price of a diamond. The better the metric or measurement in each category, the higher the price. We found that this generally holds true for each of the “4 C’s”. Although Blue Nile says that cut is the biggest factor on price, we weren’t really able to confirm or deny this claim. But when evaluating cut alongside carat, we found that the greater the carat, the less impact cut generally had on the price of the diamond. When evaluating color alongside carat, we found the opposite effect — the larger the diamond, the more impact color generally had on the price of the diamond. This makes sense logically; the bigger the diamond, the easier it is to notice the color. When it came to clarity alongside carat, we found that the diamond’s clarity had a pretty consistent impact on the price of the diamond regardless of the size. Finally, we dove deeper into the relationship between price and carat. Our analysis found that on average, for every 1% increase in carat size, there is an approximately 1.94402% increase in price.

2. Data Overview

Data set and Variable Descriptions

Our data includes 1214 diamonds (observations) with 5 columns/variables: carat, clarity, color, cut and price. This data set is a subset of the overall diamond data set given by Blue Nile. If we’re thinking along the lines of linear regression, price would likely be our response and the rest of the variables would be our predictors.

According to Blue Nile, they describe the four C’s (4C’s) -
1. Cut
2. Color
3. Clarity
4. Carat

Cut measures how well-proportioned a diamond’s dimensions are. This includes the brilliance and balance of its facets. Blue Nile says that this criterion can actually be the biggest factor in a diamond’s price. The cut scale goes by Poor, Fair, Good, Very Good, Ideal and Astor. Astor is a special category created by Blue Nile.A diamond’s cut refers to how well-proportioned the dimensions of a diamond are, and how these surfaces, or facets, are positioned to create sparkle and brilliance. The beauty depends on the diameter in comparison to its depth.

There are three characteristics Of a Well-Cut Diamond :-
Superior Brilliance - relates to the reflection of light (white light).
Fire - is the dispersion of light into the colors of the rainbow (color light).
Scintillation - is the play of contrast between dark and light areas (sparkle).

The cut is the most influential characteristic of the diamond, thus we need to understand the Anatomy Of A Diamond:

Table: The largest facet of a gemstone

Crown: The top portion of a diamond extending from the girdle to the table

Girdle: The intersection of the crown and pavilion which defines the circumference of a diamond

Diameter: The measurement from one girdle edge of a diamond straight across to the opposing side

Pavilion: The bottom portion of a diamond, extending from the girdle to the culet

Culet: The facet at the tip of a gemstone. The preferred culet is not visible with the unaided eye (graded “none” or “small”)

Depth: The height of a gemstone measured from the culet to the table

Diamond Cut And Diamond Shape Are Not The Same Thing Although these terms are sometimes used interchangeably, diamond cut, and diamond shape mean different things.

Diamond cut assesses light performance of a diamond and is based on a combination of factors: proportions, symmetry, and polish.

Diamond shape is related to the outline of a diamond. The round brilliant diamond is our most popular shape.

Color refers to how colorless the diamond is. This is measured on a scale from D (colorless) to Z (noticeably colored). Typically, the closer to colorless the diamond, the more expensive. Diamonds occur in a variety of colors—steel gray, white, blue, yellow, orange, red, green, pink to purple, brown, and black.

Diamond Grades At Blue Nile Range From D (Colorless) To K (Faintly Colored).

D-F Color Diamonds: are Colorless diamonds the rarest and highest quality with a pure icy look.

G-H Color Diamonds and I-J Color Diamonds: Near-colorless diamonds No discernible color; great value for the quality.

K Color Diamonds Faint color diamonds and less expensive than the previous ones.

Diamond prices decline or increase in alphabetical order. A diamond with a G color grade is less expensive than a diamond with a D color grade.

Clarity assesses small imperfections, called inclusions, within a diamond. This standard quantifies and specifies those inclusions. Clarity has 6 categories and 11 grades total. From worst to best, they are: I (I3-I1), SI (SI2-SI1),
VS (VS2-VS1), VVS (VVS2-VVS1), IF and FL. Typically, the higher the clarity grade, the more expensive the diamond. Blue Nile says that this criterion is the least important, so it may have the least affect on price. When we look at these in detail. Diamond Clarity Spans 6 Categories With A Total Of 11 Clarity Grades

I1, I2, I3 Included Diamonds I clarity diamonds have obvious inclusions that are likely to be visible and impact beauty Blue Nile does not sell I clarity grade loose diamonds for engagement ring designs Blue Nile does offer a limited selection of jewelry preset with I1 diamonds

SI1, SI2 Slightly Included (SI) Diamonds Inclusions are noticeable at 10x magnification If eye clean, SI diamonds are often the best value SI2 inclusions may be detectable to a keen unaided eye, especially when viewed from the side

VS1, VS2 Very Slightly Included (VS) Diamonds Minor inclusions ranging from difficult (VS1) to somewhat easy (VS2) to see at 10x magnification. Great value; Blue Nile’s most popular diamond clarity

VVS1, VVS2 Very, Very Slightly Included (VVS) Diamonds VVS diamonds have minuscule inclusions that are difficult even for trained eyes to see under 10x magnification. VVS clarity is rare and results in an eye clean appearance Characteristics are minuscule and difficult to see under 10x magnification, even to a trained eye

Internally Flawless (IF) Diamonds Some small surface blemishes may be visible under a microscope on IF diamonds IF diamonds have no inclusions within the stone, only surface characteristics set the grade Visually eye clean

Flawless (FL) Diamonds No internal or external characteristics Less than 1% of all diamonds are FL clarity A flawless diamond is incredibly rare because it’s nearly impossible to find a diamond 100% free of inclusions

The Five Diamond Clarity Factors
1. Size – The size of the inclusions in a diamond is one of the most important factors in determining its clarity grade. This is because the bigger the inclusions, the larger the impact they’ll have on a diamond’s appearance.

2. Nature – Nature refers to the type of inclusions that can be seen in the diamond, as well as the depth of these inclusions within the diamond. This aspect also covers other characteristics of inclusions that can be seen inside the diamond.

3. Number – Grading entities also take into account the number of inclusions within a diamond. If a diamond has a large number of inclusions, even if small, they can have a large impact on its appearance and clarity. The larger the number of inclusions, blemishes and other clarity characteristics, the greater the impact on a diamond’s beauty.

4. Location – The location of an inclusion refers to where on the diamond the inclusion is located. If the inclusion is situated in closer proximity to the center of the table, then the inclusion is more visible to the eye and the clarity grade will be impacted much more.

If the inclusion is close to the girdle, which is much further from the center table, then the inclusion may be more difficult to see. Inclusions found near pavilions of the diamond can reflect, and the facets will then act as mirrors which, meaning the inclusion will then be reflected.

5. Color and Relief – The color and relief is referring to how noticeable the inclusions are in comparison with the diamond put simply, how much contrast there is between the diamond and the inclusions. The higher the relief, the darker the color may seem which can affect diamond grading.

Carat measures a diamond’s weight. Typically, the heavier (larger carat), the more expensive.We need to distinguish between Carat and Size; Carat simply refers to how much a diamonds weighs, whereas Size is about its dimensions. Diamonds with higher carat weights are cut from larger rough crystals that are harder to source than small crystals. So, the relationship between carat weight and price depends on the rarity or availability of a rough crystal. Blue Nile claims that Carat has the biggest effect on price. Prices increase exponentially as carat weight goes up.

Another factor that is not part of the 4C’s ; but has influence on the price is the Shape of the diamond.

Shape Shape refers to the geometric outline and overall physical form of a diamond. Every diamond shape has its own attributes and cut specifications, which also play a large factor in the overall look of the stone.

While diamonds can be cut in any shape, there are 10 popular diamond shapes:

Round  Princess  Cushion  Oval  Emerald  Pear  Marquise  Asscher   Radiant  and Heart shape.

Carat

Carat measures a diamond’s weight. An objective measurement, carat weight is the most popular indicator for showing how large a diamond is.

In the data set there are no NA values associated with the carat variable.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Our distribution of diamonds by carat is right-skewed. When we group by cut (chosen because it’s the biggest factor in a diamond’s price), we can see that the distributions of each cut are quiet different. Because of this, it’s relatively unhelpful to compare summary statistics of carat across each cut. Notably, Very Good diamonds have the highest density above 1 carat in its distribution.

It seems carat has an exponential relationship with price, with bigger diamonds being more expensive. We’ll log transform both carat and price and create the appropriate scatter plot.

## `geom_smooth()` using formula 'y ~ x'

It seems carat does impact price, with larger diamonds being more expensive. We’ll explore this more in our linear model section later.

Cut

Cut measures how well-proportioned a diamond’s dimensions are including the balance and brilliance of its facets. Diamond cut is considered the most important of the four Cs.

Of the diamonds in our data set, cut is broken into (4) sub-groups represented as a character data type and related to the quality of the cut: Astor Ideal, Good, Ideal, and Very Good. ‘Astor Ideal’ represents the fewest number of diamonds in the data set at 20 diamonds sharing this quality of cut. ‘Ideal’ is the most represented cut with 739 diamonds sharing the ideal cut. ‘Very Good’ has 382 observations in the diamonds data set and “Good” represents 73 observations.

Although Astor Ideal cut diamonds have a greater IQR and median value than the other cuts, since there are only 20 of them in our data set, this doesn’t tell us too much. But surprisingly, Ideal cut diamonds have a lower median price than Very Good or Good diamonds while Very Good and Good diamonds have a very similar IQR and close median values. This indicates that the other criteria have a sizable impact on price. However, about 60% of our data set contains Ideal cut diamonds, which could explain why Ideal cut diamonds have a wider IQR.

## `geom_smooth()` using formula 'y ~ x'

Since our price against carat scatter plot has an exponential curve — as we’ll later see in the linear model section — we’ve log transformed those variables before plotting this particular visualization. Fitting regression lines grouped by cut, we can see that the line for Ideal cut diamonds has a slightly steeper slope than the lines for Very Good and Good cuts. The variations in slope suggest that as a diamond gets bigger, the cut type has a lesser impact on the diamond’s price. Since they converge as the carat gets larger, this suggests that cut type has a large impact on the price of smaller diamonds but becomes less important as the diamond gets bigger. Large diamonds could be expensive simply because they’re large, but small diamonds seem to be relatively largely impacted by cut type. This suggests that cut type has a relatively big impact on the price of the diamond. Regardless, it appears cut type does affect the diamond’s price and that better cuts are generally more expensive.

Color

Diamond color refers to how colorless a diamond is. Color is the second most important of the 4Cs of diamonds. The less color, the higher the grade.

  • Variable Description
  • Data Type characters
  • Uni-variate proportions visualization
  • Color - Price Bi-variate Relationship Visualization

We have a pretty even proportion of each color of diamond in our data set. Interestingly, although D is the best color, it has the second highest proportion of diamonds in our data set, implying that it isn’t rare. Although the IQR’s of each color differ slightly, the medians are very similar across the colors. This suggests that color might not have a significant impact on the price of the diamond.

## `geom_smooth()` using formula 'y ~ x'

Interestingly, when fitting regression lines by color, we can see the slopes of the lines match the color quality in that D has the greatest slope while J has the weakest. Since the regression lines diverge, this suggests that on average, the larger the diamond, the greater the impact of color on the diamond’s price. It’s worth noting that our proportion table shows that the better colors aren’t necessarily rarer. Perhaps this lack of scarcity curbs the color’s impact on the price of smaller diamonds, while larger diamonds could be more expensive more due to the size than the color. Still, color quality does affect the price of the diamond and better colored diamonds are generally more expensive.

Clarity

Clarity assesses small imperfections within a diamond. Inclusions can occur naturally during the diamond forming process. Clarity is used to quantify and specify any inclusions.

There are very few flawless diamonds in our data set while the large majority of our diamonds are VS1 clarity or worse. This implies that diamond clarity might be a more rare occurrence than other criteria (for example, color).

The median price of flawless diamonds are so far beyond the other clarity categories, but we also have so few flawless diamonds in our data set that this isn’t a particularly useful insight. The median prices of the rest of the clarity categories are very close, while their IQR’s are pretty similar.

## `geom_smooth()` using formula 'y ~ x'

Interestingly, the slopes of the regression lines in our scatter plot by clarity are quite similar. This suggests that regardless of diamond size, clarity has a consistent impact on the price of the diamond. Clarity also seems to affect the price of a diamond, with better clarity diamonds being generally more expensive.

Price
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Price is right-skewed, showing we have many more lower priced diamonds in our data set than expensive ones.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    322.0    723.5   1463.5   7056.7   4640.8 355403.0

3. Simple Linear Regression Model

## 
## Call:
## lm(formula = price ~ carat, data = diamonds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49375  -5048   1867   4965 236711 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13550.9      559.7  -24.21   <2e-16 ***
## carat        25333.9      494.4   51.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13560 on 1212 degrees of freedom
## Multiple R-squared:  0.6842, Adjusted R-squared:  0.6839 
## F-statistic:  2625 on 1 and 1212 DF,  p-value: < 2.2e-16

The first model we fit is diamond price predicted by diamond carat. This model produced strong P values for both the intercept and carat coefficients and has an estimated 25333 dollar increase in predicted price per carat.

While the initial model is strong in P-values, it does not meet the assumptions of simple linear regression. As shown in the Initial Model Residual Plot, there seems to be an open funnel in residual variance that grows as the predicted values grow. Additionally, many of the predicted values on the lower end tend to have negative residuals while more expensive diamonds tend to have positive residuals. This plot suggests that we should look to transform one (or both) of our variables.

This boxcox plot suggest that we should raise our response variable (price) to the power of about .3. Instead of raising to the power of .3, we decided to perform a log transformation on price given that the lambda was so close to 0.

## 
## Call:
## lm(formula = price_trans ~ carat, data = diamonds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0128 -0.4350  0.0067  0.4139  1.8178 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.43235    0.02490  258.37   <2e-16 ***
## carat        1.45739    0.02199   66.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6032 on 1212 degrees of freedom
## Multiple R-squared:  0.7837, Adjusted R-squared:  0.7835 
## F-statistic:  4392 on 1 and 1212 DF,  p-value: < 2.2e-16

After transforming the price variable, we fit our second model. This model too has strong p-values for the intercept and carat coefficients. As before though, we need to check simple linear regression assumptions.

Our new residual plot shows much improved constant variance but also shows negative residuals on the high and low end of predicted values while we see positive residuals in the middle of the predicted values. This suggests that we still want to transform our predictor variable (carat). We will do so in our final model. We ended up performing a log transformation on the carat predictor before fitting it to our model.

## 
## Call:
## lm(formula = price_trans ~ carat_trans, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96394 -0.17231 -0.00252  0.14742  1.14095 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.521208   0.009734   875.4   <2e-16 ***
## carat_trans 1.944020   0.012166   159.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2761 on 1212 degrees of freedom
## Multiple R-squared:  0.9547, Adjusted R-squared:  0.9546 
## F-statistic: 2.553e+04 on 1 and 1212 DF,  p-value: < 2.2e-16

## `geom_smooth()` using formula 'y ~ x'

Our final model residual plot shows residuals that are evenly scattered positively and negatively. There is a small wavelike appearance to the residuals but overall, it has constant variance across the predicted values and is evenly spread positively and negatively. Additionally, looking at a scatter plot of log(Price) vs log(Carat), it appears there is a very strong linear relationship, which suggests (along with everything else) that we have a strong simple linear regression. Our final model is log(price) = Beta_0 + Beta_1 * log(Carat) + error. It performs exceptionally well with diamonds that are neither on the very cheap or very expensive ends though even in these instances, it performs well. The regression equation in our final model is yhat = 8.521208 + 1.944020x. We can interpret this as for every 1% increase in carat, there is an approximately 1.94402% increase in price. (Source for interpretation: https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/)