Welcome

Notes: In this section, Facebook data scientist Saung Masin shares some of his work in EDA. This lesson will cover an analysis of diamonds. Some of the things that will be covered include the rich history of the diamond market and how to use the EDA techniques covered in this course to develop a quantitative understanding of it.

The end goal is to build a predictive model of diamonds that is going to help us figure out whether a given diamond is a good deal or a rip-off.

This lesson will give a better understanding of what goes into the price of a diamond. The socioeconomic and political history of the diamond industry is quite fascinating as well. Diamonds gave rise to the mining industry in South Africa, which is now the most advanced economy the region. Diamonds also drove the British and the Dutch to colonize southern Africa in the first place and have driven conflicts ranging from the Boer wars to modern day civil strife across the region. ***

Linear Regression Models

In this section, one of the instructors (Solomon) will use a linear regression to predict diamond price using other varaibles in the diamonds dataset.

suppressMessages(library("ggplot2"))
data(diamonds)

Scatterplot Review

Let’s start by examining two variables in the data set. The scatterplot is a powerful tool to help you understand the relationship between two continuous variables.

We can quickly see if the relationship is linear or not. In this case, we can use a variety of diamond characteristics to help us figure out whether the price advertised for any given diamond is reasonable or a rip-off.

Let’s consider the price of a diamond and it’s carat weight. Create a scatterplot of price (y) vs carat weight (x).

Limit the x-axis and y-axis to omit the top 1% of values.

ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point(fill=I("#F79420"), color=I("black"), shape=21) +
  scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99)) ) +
  scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)) ) +
  ggtitle("Diamonds: Price vs. Carat")


Price and Carat Relationship

What do you notice about the relationship here between price and carat?

Response: There is an obvious positive (non-linear) relationship between both variables: as carat size increases, price also increases. It could be exponential. Also, the variance of the relationship increases as carat size increases. There is also very clear discrete values that carat size takes on, which are those vertical strips on the graph.

We can see that the linear trend line doesn’t go through the center of the data at some key places. It should curve in certain parts of the graph. It should slope up more towards the end. If we tried to use this for predictions, we might be off some key places inside and outside of the existing data that we have displayed.

ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point(fill=I("#F79420"), color=I("black"), shape=21) +
  stat_smooth(method="lm") +
  scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99)) ) +
  scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)) ) +
  ggtitle("Diamonds: Price vs. Carat")


Frances Gerety

Notes: So far, we have only considered the price and carat weight. However, there is much more to this data set.

This diamonds data set comes with the ggplot2 package and contains the prices and the specs for more than 50,000 diamonds collected in 2008 from diamondsc.info. Analyzing this data set is useful because diamonds are unique in the way that isn’t true of most manufactured products that we are used to buying. You can’t just put in a model number and look up the price.


Saung Masin discusses the history of the market for diamonds: Hidden in the data are reflections of how a legendary marketing campaign permeated and was subsumed by our culture. Hints about how different social strata responded and how the diamond market functions today as a result. The story starts in 1870 when many tons of diamonds were discovered in South Africa near the Orange River. Until then, the diamond market had been small.

Only a few pounds of diamonds were mined each year from India and Brazil. At the time, there was no use for diamonds outside of jewelry, so price depended only on scarce supply. Hence, the project investors formed the De Beers Cartel in 1888 to try to control the global price of diamonds. By most accounts, this has been the most successful cartel in history. But World War I and the Great Depression saw diamond sales plummet.

In 1938 the De Beers Cartel contacted Philadelphia ad agency N.W. Ayer & Son to inquire whether, quote, “the use of propaganda in various forms might help jump start diamond sales in the U.S.” which looked like the only viable potential market for diamonds at the time.

Surveys showed, however, that among couples contemplating marriage, diamonds were low on the list of priorities: a luxury for the rich. Frances Gerety took on the the De Beers account at N.W. Ayer & Son and worked towards the company’s goal to quote “create a situationin which every couple contemplating marriage feels the need to acquire a diamond engagement ring.” A few years later she contined a famous slogan.

A diamonds is forever!

This slogan has appeared in every De Beers ad since 1948. In 1999, two weeks before Mrs. Gerety died, Advertising Age named it the slogan of the century!


The Rise of Diamonds

Notes: A ton of people argure that this campaign gave birth to modern demand advertising. The goal was not demand generation, but simply to impress the glamor, the dentiment and the emotional charge contained in the product itself. The company gave diamonds to movie stars and emphaized the size of diamonds that celebrities gave each other. They even persuaded the British royal family to wear diamonds over other gems.

Later, De Beers marketed diamonds to couples as a status symbol to reflect “a man’s success in life.” Gerety succeeded because getting engaged in America means getting a diamond.


ggpairs Function

Notes: There are multiple ways to create a scatterplot matrix:

Method #1: ggpairs plots each variable against the other in a pretty smart way.

In the lower triangle of the plot matrix, it uses grouped histograms for qualitative-qualitative pairs and scatterplots for quantitative-quantitative pairs.

In the upper triangle, it plots grouped histograms for qualitative-qualitative pairs, this time using the x instead of the y variable as the grouping factor; boxplots for qualitative-quantitative pairs, and it provides the correlation for quantitative-quantitative pairs.

Warning: The ggpairs() function can take a very long time to create the scatterplot matrix!! This is why it is important to sample your data as well and, if possible, exclude variables that are not of absolute interest. Having more than 10 variables will make each plot too small.

# install these if necessary
#install.packages('GGally') #used for ggpairs
#install.packages('scales') 
#install.packages('memisc') #used to summarize the regression
#install.packages('lattice')
#install.packages('MASS') 
#install.packages('car') #used to recode variables
#install.packages('reshape') #used to reshape and wrangle your data
#install.packages('plyr') #used to create interesting summaries and transformation for data

# load the ggplot graphics package and the others
suppressMessages(library(ggplot2))
suppressMessages(library(GGally))
suppressMessages(library(scales))
suppressMessages(library(memisc))

# sample 10,000 diamonds from the data set to get a snapshop of the large dataset
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, 
        lower = list(continuous = wrap("points", shape = I('.'))), 
        upper = list(combo = wrap("box", outlier.shape = I('.'))))


Method #2: We can also use the psych package and create a scatterplot matrix with the pairs.panels() function. However, it does not create a mixture of scatterplots, boxplots, and histograms based on whether the variable is quantitative or qualitative. This is where ggpairs has an advantage over this function. However, pairs.panels() does not take as much time to plot as the method from above, which is a plus.

#Alternative Scatterplot Matrix Function
suppressMessages(library(psych))
pairs.panels(diamond_samp,pch=".")


What are some things you notice in the ggpairs output? Response: Referring scatterplot matrix from method #1: First I looked at the diagonal plots: color, clarity, depth, and perhaps table appear to be normally distributed.

Next, I noticed that the x, y, and z variable appear to be highly correlated with each other, which makes sense since they are the dimensions of a diamond.

Also, x vs price, y vs. price, and z vs. price appear to be some sort of log graph: a price increases, each of these other variables also increase in a log-like fashion. Also. x vs. carat, y vs. carat, and z vs. carat also have the same log-like relationship.

Lastly, notice that the factor that determines the price of a diamond is the size, or carat weight, of the diamond. Recall, that carat vs price have an exponential relationship. Why might this be? Larger continuous chunks of diamonds without significant flaws are probably harder to find than smaller ones. This might help explain the sort of exponential looking curve.

Recall that weight is a function of volume and volume = xyz . This suggest that we might be interested in the cube root of carat weight. It is often the case that leveraging substantive knowledge about your data like this can lead to expecially fruitful transformations.


The Demand of Diamonds

Notes: Masin talks about the demand for diamonds. Refer to the following plot:

ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point(fill=I("#F79420"), color=I("black"), shape=21) +
  scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99)) ) +
  scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)) ) +
  ggtitle("Diamonds: Price vs. Carat")

On the demand side, customers in the market for a less expensive, smaller diamond are probably more sensitive to price than more well-to-do buyers. Many less than one carat customers would surely never buy a dimond were it not for the social norm of presenting one when proposing.

Also, there are fewer customers who can afford a bigger diamond (one larger than one carat). Hence, we shouldn’t expect the market for bigger diamonds to be as competitive as the one for smaller diamonds. So it makes sense that the variance, as well as the price, would increase with carat size.

Now, often the distribution of any monetary variable like dollars will be highly skewed and vary over orders of magnitude. This can result from path dependence, for example the rich getting richer or multiplicative processes like year on year inflation or some combination of both. Hence, it is a good idea to look into compressing any such variable by putting it on a log scale.

Refer to this link for more information on when to tranform a variable: https://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/.

suppressMessages(library(gridExtra))

plot1 <- ggplot(aes(price), data=diamonds) + 
  geom_histogram(binwidth=200) +
  ggtitle('Price')

plot2 <- ggplot(aes(price), data=diamonds) +
  geom_histogram() +
  ggtitle('Price (log10)') +
  scale_x_log10()

grid.arrange(plot1, plot2, ncol=2)


Connecting Demand and Price Distributions

Notes: When looking at these plots, what do you notice? Think specifically about the two peaks in the transformed plot and how it relates to the demand for diamonds.

Response: We can see that the price for diamonds are heavily skewed (long right tail). However, when you look at price on the log10 scale, it seems much better behaved. It is a lot closer to the bell curve normal distribution. It even shows evidence of bimodality on this scale. There appear to be two different “populations.” One peak might represent the mean for the people that buy diamonds of size one carat or less. And the other mean represents the rich people that can afford higher priced diamonds.


Scatterplot Transformation

Now that we have a better understanding of the demand for diamonds, lets replot the data. This time we will put price on a log10 scale and here is what it looks like. This plot looks better than before.

p1 <- ggplot(aes(carat, price), data=diamonds) +
  geom_point() +
  ggtitle("Price by Carat")
p2 <- ggplot(aes(carat, price), data=diamonds) +
  geom_point() +
  scale_y_continuous(trans=log10_trans()) +
  ggtitle("Price (log10) by Carat")
grid.arrange(p1, p2, ncol=2)

On the log scale, the prices look less dispersed at the high end of carat size and price, but actually, we can do better. Let’s try using the cube root of carat in light of our speculation about flaws being exponentially more likely in diamonds with more volume. Remember, volume is on a cubic scale!

Create a new function to transform the carat variable

First, we need a function to transform the carat variable. Refer to these series of helpful videos on creating functions in R: https://www.youtube.com/watch?v=Z1wB1rHAYzQ&list=PLOU2XLYxmsIK9qQfztXeybpHvru-TrqAP. This cuberoot_trans function takes the cube root of any input variable and it also has an inverse function to undo that operation, which we need to display the plot correctly.

cuberoot_trans = function() trans_new('cuberoot', 
                                      transform = function(x) x^(1/3),
                                      inverse = function(x) x^3)

Use the cuberoot_trans function

Then, when we get to our actual ggplot command, what we’ll do is we’ll use the scale_x_continuous argument and transform the x-axis with this cube root transformation function. We are also transforming the y-axis with the log10 transformation.

ggplot(aes(carat, price), data = diamonds) + 
  geom_point() + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')

With these transformations that got our data on a nice scale, things look almost linear. We can now move forward and see how to model our data using just a linear model.


Overplotting Revisited

We still have to deal with overplotting, which occurs when multiple points take on the same value. Oftentimes, this occurs because of rounding. The following code allows us to see the carat and price values that occur the most frequently. These repeated values can really obscure the density and the sparsity of our data and really key points.

head(sort(table(diamonds$carat), decreasing=T))
## 
##  0.3 0.31 1.01  0.7 0.32    1 
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing=T))
## 
## 605 802 625 828 776 698 
## 132 127 126 125 124 121

To work around overplotting, the alpha, size, and and jitter option can be used for our plot.

ggplot(aes(carat, price), data = diamonds) + 
  geom_point(alpha=1/10, size=3/4, position = "jitter") + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')


Other Qualitative Factors

Notes: We can see what looks like an almost linear relationship between carat weight and price after the transformations. However, there are other factors that influence the price of a diamond. Clarity also factors into price. However, people commonly search for a diamond of a specific size, so we shouldn’t expect clarity to be as strong a factor as size (carat). Cut can also have a major impact on the fiery trait that diamonds have, so we would expect it to have some effect on price.


Price vs. Carat and Clarity

Alter the code below.

# Adjust the code below to color the points by clarity.

# A layer called scale_color_brewer() has 
# been added to adjust the legend and
# provide custom colors.

# You will need to install the package RColorBrewer
# in R to get the same colors and color palettes.

# install and load the RColorBrewer package
#install.packages('RColorBrewer')
suppressMessages(library(RColorBrewer))

ggplot(aes(x = carat, y = price, color=clarity), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type='div', 
                     guide=guide_legend(title='Clarity',
                                        reverse=T,
                                        override.aes=list(alpha=1, size=2))) +  
  scale_x_continuous(trans = cuberoot_trans(), 
                     limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), 
                     limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')

In order to color the points by clarity, we pass the argument color=clarity inside the aesthetic wrapper for ggplot.

Next, it is important to understand what the scale_color_brewer does: Type provides the graph with the particular types of color chosen. The guide argument allows us to set the desired conditions for the points and the text for the legend.


Clarity and Price

Response: Clarity does explain some of the change in price because it is clear that diamonds that are “IF” are the most expensive whereas “I1” are the least expensive clarity types. In other words, holding carat weight constant, we see that diamonds with lower clarity are almost always cheaper than diamonds with better clarity. Clarity explains a lot of the variance found in price!


Price vs. Carat and Cut

Alter the code below.

ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Cut', 
                                          reverse = T,
                                          override.aes = list(alpha = 1, 
                                                              size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Cut')


Cut and Price

Based on the plot, do you think cut accounts for some of the variance in price? Why?

Response: I believe that cut does NOT account for a huge chunk of the variance in price like clarity did. However, it does appear to account for a tiny percentage of the price variance. Keeping carat constant, we see some of the browns (fair and good) at the bottom, some of the white (very good) and light green (premium) at the middle, and the dark green (ideal) at the top. However, the division between these categories is not as clear as it was for clarity.

Solomon believes the answer to this question is debatable. He says that we don’t see much variation on cut. Most of the diamonds in the data are ideal cut anyway, so we lost the pattern that we saw before for clarity. ***

Price vs. Carat and Color

Alter the code below.

# Finally, let’s use diamond color to color our plot.

# Adjust the code below to color the points by diamond colors
# and change the titles.

ggplot(aes(x = carat, y = price, color = color), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = "Color", 
                                          reverse = F,
                                          override.aes = list(alpha = 1, 
                                                              size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Color')

Instructor Note: “Did you remove reverse = T on the guide_legend? It’s a small detail that reminds us that D is the best color and J is worse for rating diamond colors.”


Color and Price

Based on the the plot, do you think that the color of a diamond influences price? Why?

Response: Keeping carat size constant, we see very clear division on the various diamond “color” categories similar to what we saw for “clarity”. Therefore, diamond color does have a clear influence on price.

Solomon’s response: “Color does seem to explain some of the variance in price, just like we saw with the clarity variable. Blue Nile, a diamond retailer, however states that the difference between all color grades from D to J are basically not noticeable to the naked eye. Yet, we do see the color difference in the price tag.”

Read more on Blue Nile’s thoughts about diamond color at: http://www.bluenile.com/education/diamonds/color.


Linear Models in R

Notes: To create a linear model in R, we will use the lm function. Price is the outcome and carat is the predictor variable. We use our domain knowledge of diamonds and carat weight to take the cube root of carat weight (volume). Therefore we use the following function in R: lm(log(price) ~ carat^(1/3)).

Remember, we applied the log transformation to our long-tailed dollar variable and we speculated that the flawless diamond should become exponentially rarer as diamond volume increases. So we should be interested in the cube root of carat weight.


Building the Linear Model

Notes: We build the linear model for price. Notice the I() wrapper around each variable of the first model. The I stands for as is. In this case, it tells R to use the expression inside the I function to transform a variable before using it in the regression. This is instead of instructing R to interpret these symbols as part of the formula to construct the design matrix for the regression.

Information on the lm model can be found at: http://data.princeton.edu/R/linearModels.html.

We can also update the previous model to add the caret variable in the regression. The real functional relationship is surely not as simple as the cubed root of caret, so we add a simple linear function of caret in our model predicting price. We can continue to make more complex models by adding more variables. We add cut even though we don’t expect it to have much influence on price.

Next, we add color to a 4th model and clarity to a 5th model.

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)

m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5, sdigits=4)
## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamonds)
## 
## ==================================================================================
##                       m1           m2           m3           m4          m5       
## ----------------------------------------------------------------------------------
##   (Intercept)       2.821***     1.039***     0.874***    0.932***     0.415***   
##                    (0.006)      (0.019)      (0.019)     (0.017)      (0.010)     
##   I(carat^(1/3))    5.558***     8.568***     8.703***    8.438***     9.144***   
##                    (0.007)      (0.032)      (0.031)     (0.028)      (0.016)     
##   carat                         -1.137***    -1.163***   -0.992***    -1.093***   
##                                 (0.012)      (0.011)     (0.010)      (0.006)     
##   cut: .L                                     0.224***    0.224***     0.120***   
##                                              (0.004)     (0.004)      (0.002)     
##   cut: .Q                                    -0.062***   -0.062***    -0.031***   
##                                              (0.004)     (0.003)      (0.002)     
##   cut: .C                                     0.051***    0.052***     0.014***   
##                                              (0.003)     (0.003)      (0.002)     
##   cut: ^4                                     0.018***    0.018***    -0.002      
##                                              (0.003)     (0.002)      (0.001)     
##   color: .L                                              -0.373***    -0.441***   
##                                                          (0.003)      (0.002)     
##   color: .Q                                              -0.129***    -0.093***   
##                                                          (0.003)      (0.002)     
##   color: .C                                               0.001       -0.013***   
##                                                          (0.003)      (0.002)     
##   color: ^4                                               0.029***     0.012***   
##                                                          (0.003)      (0.002)     
##   color: ^5                                              -0.016***    -0.003*     
##                                                          (0.003)      (0.001)     
##   color: ^6                                              -0.023***     0.001      
##                                                          (0.002)      (0.001)     
##   clarity: .L                                                          0.907***   
##                                                                       (0.003)     
##   clarity: .Q                                                         -0.240***   
##                                                                       (0.003)     
##   clarity: .C                                                          0.131***   
##                                                                       (0.003)     
##   clarity: ^4                                                         -0.063***   
##                                                                       (0.002)     
##   clarity: ^5                                                          0.026***   
##                                                                       (0.002)     
##   clarity: ^6                                                         -0.002      
##                                                                       (0.002)     
##   clarity: ^7                                                          0.032***   
##                                                                       (0.001)     
## ----------------------------------------------------------------------------------
##   R-squared            0.9236       0.9349       0.9391      0.9514       0.9839  
##   adj. R-squared       0.9236       0.9349       0.9391      0.9514       0.9839  
##   sigma                0.2805       0.2588       0.2504      0.2237       0.1286  
##   F               652012.0628  387489.3661  138654.5235  87959.4667  173791.0840  
##   p                    0.0000       0.0000       0.0000      0.0000       0.0000  
##   Log-likelihood   -7962.4993   -3631.3185   -1837.4158   4235.2405   34091.2720  
##   Deviance          4242.8307    3613.3602    3380.8373   2699.2124     892.2143  
##   AIC              15930.9985    7270.6370    3690.8316  -8442.4809  -68140.5440  
##   BIC              15957.6854    7306.2195    3761.9966  -8317.9421  -67953.7358  
##   N                53940        53940        53940       53940        53940       
## ==================================================================================

When running this code, we can see that we get some very nice R square values. We are accounting for almost all of the variance in price using carat, cut, color and clarity. If we want to know whether the price of a diamond is reasonable, we might now use this model.

Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.


Model Problems

Research: What could be some probelsm when using this model? What else should we think about when using this model? (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)

Response: This model only includes information about the prices for diamonds at a specific point in time (this data is from 2008). Therefore, the diamond market is probably quite different than it is now in 2016 (~8 year gap). Inflation and other things might have had a big effect on our model. Also, this predictive model might not be the most beneficial for the future based on the following information: “From 2018 onward, as existing mines get depleted and no major new deposits come online, supply is expected to decline, falling behind expected demand growth that will be driven by China, India and the US. Over the next 10-year period, supply and demand are expected to grow at a compound annual rate of 2.0% and 5.1%, respectively.” In other words, demand will surpass supply in the near future and the price of diamonds will rise exponentially! Therefore, we have to constantly update our model with the most recent set of diamond information and not use data that is 8 years old.

The above quoted information was obtained from: http://www.bain.com/publications/articles/global-diamond-report-2013.aspx.

Solomon’s Response: “To start, this data is from 2008. When I fitted models using this data and predicted the price of the diamonds that I found in the market, I kept getting predictions that were way too low. After some additional digging, I found that global diamonds were poor. It turns out that prices plummeted in 2008 due to the global financial crisis and since then, prices at least for wholesale polished diamonds, have grown at about 6% per year, compound annual rate… And finally, after looking at the data on price scope, I realized that diamond prices grew unevenly across different karat sizes since 2008. Therefore, the initially estimated model could not simply be adjusted by inflation.”

Resources:


A Bigger, Better Data Set

Notes: Solomon was able to get a more up-to-date diamond price data from diamondse.info. This data set was 10x the size of the 2008 data set and features diamonds from all over the world certified by an array of authorities besides just the Gemological Institute of America (GIA).

#install.packages('bitops')
#install.packages('RCurl')
library('bitops')
library('RCurl')

#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))
load("BigDiamonds.rda")

The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data

Building a Model Using the Big Diamonds Data Set

Notes: We first subset the data to only look at GIA certified diamonds and only look at diamonds under $10,000 because these are the types of diamonds sold at most retailers. By trimming our data, our model will be less likely thrown off by outliers and the higher variants at the high-end of price and carat.

#Only want GIA Certified diamonds that are under $10,000
diamondsbig <- diamondsbig[diamondsbig$price < 10000 & 
                             diamondsbig$cert == "GIA",]

#Sample from this large dataset
set.seed(20022012)
diamondsBigSample <- diamondsbig[sample(1:nrow(diamondsbig), 10000), ]

#Create 5 different models with variables of interest.
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamondsBigSample)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5, sdigits=4)
## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamondsBigSample)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamondsBigSample)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamondsBigSample)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamondsBigSample)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamondsBigSample)
## 
## ==============================================================================
##                       m1          m2          m3          m4          m5      
## ------------------------------------------------------------------------------
##   (Intercept)      2.655***    1.317***    0.895***    0.457***   -0.380***   
##                   (0.018)     (0.074)     (0.073)     (0.060)     (0.058)     
##   I(carat^(1/3))   5.854***    8.260***    8.686***    8.164***    8.439***   
##                   (0.021)     (0.130)     (0.127)     (0.101)     (0.071)     
##   carat                       -1.063***   -1.236***   -0.794***   -0.802***   
##                               (0.057)     (0.055)     (0.044)     (0.031)     
##   cut: V.Good                              0.140***    0.099***    0.069***   
##                                           (0.010)     (0.008)     (0.005)     
##   cut: Ideal                               0.230***    0.194***    0.134***   
##                                           (0.009)     (0.007)     (0.005)     
##   color: K/L                                           0.147***    0.136***   
##                                                       (0.021)     (0.015)     
##   color: J/L                                           0.337***    0.319***   
##                                                       (0.019)     (0.013)     
##   color: I/L                                           0.475***    0.476***   
##                                                       (0.019)     (0.013)     
##   color: H/L                                           0.588***    0.606***   
##                                                       (0.019)     (0.013)     
##   color: G/L                                           0.657***    0.677***   
##                                                       (0.019)     (0.013)     
##   color: F/L                                           0.716***    0.738***   
##                                                       (0.019)     (0.013)     
##   color: E/L                                           0.748***    0.759***   
##                                                       (0.019)     (0.013)     
##   color: D/L                                           0.839***    0.837***   
##                                                       (0.019)     (0.013)     
##   clarity: I1                                                      0.107**    
##                                                                   (0.041)     
##   clarity: SI2                                                     0.437***   
##                                                                   (0.040)     
##   clarity: SI1                                                     0.555***   
##                                                                   (0.040)     
##   clarity: VS2                                                     0.669***   
##                                                                   (0.040)     
##   clarity: VS1                                                     0.729***   
##                                                                   (0.040)     
##   clarity: VVS2                                                    0.769***   
##                                                                   (0.040)     
##   clarity: VVS1                                                    0.830***   
##                                                                   (0.040)     
##   clarity: IF                                                      0.894***   
##                                                                   (0.041)     
## ------------------------------------------------------------------------------
##   R-squared           0.8855      0.8893      0.8968      0.9360      0.9688  
##   adj. R-squared      0.8855      0.8893      0.8968      0.9359      0.9688  
##   sigma               0.2952      0.2901      0.2802      0.2208      0.1541  
##   F               77219.7220  40133.1337  21701.7573  12153.7745  15495.1216  
##   p                   0.0000      0.0000      0.0000      0.0000      0.0000  
##   Log-likelihood  -1983.8138  -1811.8507  -1461.4758    920.9049   4517.3303  
##   Deviance          870.1127    840.6670    783.7190    486.4322    236.7707  
##   AIC              3973.6275   3631.7014   2934.9515  -1813.8099  -8990.6607  
##   BIC              3995.2556   3660.5387   2978.2076  -1712.8791  -8832.0552  
##   N                9990        9990        9990        9990        9990       
## ==============================================================================

The output is similar to that of the smaller diamonds dataset, although the R-squared values are a bit weaker than our previous model.


Predictions

Let’s use our newly created model to make predictions. Remember that we need to exponentiate our result because we are obtaining the prediction of price in (natural) logarithmic form.


Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.

Example Diamond from BlueNile: Carat 1.00, Cut = Very Good, color = I, clarity = VS1, Listed Price = $5,601

#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
                         color = "I", clarity="VS1")
(modelEstimate = predict(m5, newdata = thisDiamond,
                        interval="prediction", level = .95) )
##        fit      lwr      upr
## 1 8.531352 8.228985 8.833718
exp(modelEstimate)
##        fit      lwr      upr
## 1 5071.296 3748.027 6861.754

Response: Our results show that the BlueNile diamond is a bit pricier ($5,601) than the expected value under the full model (Predicted Price = $5,071). However, it is by no means outside of the 95% CI. According to Solomon, “BlueNile has a better reputation than diamonSE.info. While this model might give you a sense of whether your diamond is a ripoff, against diamondSE.info diamonds, it’s not clear that diamondSE.info should be regarded as the universal source of truth about whether the price of the diamond is reasonable. Nonetheless, to have the expected price and an SE with a 95% confidence interval is a lot more information that we had about the price we should be willing to pay for a diamond before the start of this exercise.”


Side Note: The prediction interval here may be slightly conservative, as the model errors are heteroskedastic over carat (and hence price) even after our log and cube-root transformations.

#See the output of the following code.
dat = data.frame(m5$model, m5$residuals)

#SD of residualds
with(dat, sd(m5.residuals))
## [1] 0.1539583
with(subset(dat, carat > .9 & carat < 1.1), sd(m5.residuals))
## [1] 0.1203055
dat$resid <- as.numeric(dat$m5.residuals)
ggplot(aes(y = resid, x = round(carat, 2)), data = dat) +
  geom_line(stat = "summary", fun.y = sd) +
  ggtitle("Model #5: Residuals vs. Carat Size")

How could we do better? If we care most about diamonds with carat weights between 0.50 and 1.50, we might restrict the data we use to fit our model to diamonds that are that size - we have enough data. Notice how we get larger residuals for carat sizes greater than 2.0, which has an affect on the size of our confidence interval.


Final Thoughts

Notes: Facebook data scientist Saung Masin brings up an important point: “even though we can predict the price of a diamond based on a function of the 4 c’s in diamonds, one thing that we shouldn’t conclude with this exercise is that where you buy your diamond is irrelevant. You will almost surely pay more for the same diamond at Tiffany’s compared to Costco. However, you can use a model like this to get a sense of whether you are overpaying. One last note is that data and models are never infallible and you can still get taken even equipped with this kind of analysis. There is no substitute for establishing a personal connection and a lasting business relationship with a jeweler you can trust.”

Resource:

“Tiffany vs. Costco: Which Diamond Ring Is Better” http://www.bloomberg.com/news/articles/2013-05-06/tiffany-vs-dot-costco-which-diamond-ring-is-better.


Final Quote:

** “An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question” - John Tukey **