This report contains the Exploratory Data Analysis of famous Diamond Dataset that include the prices and other attributes of almost 54,000 diamonds and Linear Regression model based on given observations to predict the diamond price.

This dataset contains 10 variable and 53,940 values of round cut diamonds. The details of variables is as below:

Price : Price in US dollars ($326-$18,823)
Carat : Weight of the diamond (0.2-5.01)
Cut : Quality of the cut (Fair, Good, Very Good, Premium, Ideal)
Color : Diamond colour, from J (worst) to D (best)
Clarity : A measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))
X : Length in mm (0-10.74)
Y : Width in mm (0-58.9)
Z : Depth in mm (0-31.8)
Depth : Total depth percentage = (z / mean(x, y)) = (2 * z / (x + y) (43-79))
Table : Width of top of diamond relative to widest point (43-95)

You can visit this website for extra information about diamonds:
http://www.diamondse.info/

Statistical Properties:

The ‘Price’, ‘Carat’, ‘X’, ‘Y’, ‘Z’, ‘Depth’, ‘Table’ are the quantitive variable while ‘Cut’, ‘Color’, ‘Clarity’ are quanlitative. I’ve introduced new variable ‘Volume’ which is derived from the dimensions i.e. ‘X’, ‘Y’, ‘Z’.

##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z              volume       
##  Min.   : 0.000   Min.   : 0.000   Min.   :   0.00  
##  1st Qu.: 4.720   1st Qu.: 2.910   1st Qu.:  65.14  
##  Median : 5.710   Median : 3.530   Median : 114.81  
##  Mean   : 5.735   Mean   : 3.539   Mean   : 129.85  
##  3rd Qu.: 6.540   3rd Qu.: 4.040   3rd Qu.: 170.84  
##  Max.   :58.900   Max.   :31.800   Max.   :3840.60  
## 

The Price Frequency:

The most of diamonds are available in the range of 0-2000 dollar. This is a reasonable range also because mostly people prefer to buy diamonds within this budget. We can observe a high peak near value of 1000 dollar at which 5250 diamond samples are present and there is a small peak around 2000 dollar. This is due some standard value reason as mostly we’re able to find the price tags of 999, 1499 or 1999 in the market. There is not much high change observed in the frequency of higher range of price values.

The Diamond Prices Frequency accourding to cut,color and clarity:

By Cut:

The beauty of cut reveals the true optical properties of the diamonds. In perticular, the high refractive index and color dispersion. Most diamonds present are of ‘ideal cut’ but the diamonds are also available in various different cuts. This is due to alteration in the looks or brilliance of diamond because of other factors like ‘color’, ‘carat’ who demands different kind of cut rather then ideal or if I simply say so to provide the best possible diamonds at given price, there is some trade-off in properties.
The diamonds are of ideal cut are below the range of 3000. If we move towards high price, the count of samples present are reduced. Around 50% of diamonds samples are in less then 7000 price range. The ‘Fair cut’ isn’t much polular but still exist in market with the 2.98% of total value.

Fair Good Very.Good Premium Ideal
Minimun Value 337.00 327.00 336.00 326.00 326.00
First Quantile 2050.25 1145.00 912.00 1046.00 878.00
Median 3282.00 3050.50 2648.00 3185.00 1810.00
Third Quantile 5205.50 5028.00 5372.75 6296.00 4678.50
Maximum Value 18574.00 18788.00 18818.00 18823.00 18806.00
Mean 4358.76 3928.86 3981.76 4584.26 3457.54
Percentage of overall 2.98 9.10 22.40 25.57 39.95
Color:

The valuable dimonds are classified as colorless. But nowdays, There are diamonds present in various colors including yellow, red, green and rare color such as black. The color is graded by letter from ‘D’ to ‘J’. ‘D’(best) is for colorless and as we proceed to ‘J’(worst), it proceed to less colorlessnesss.

Mostly, the ‘H’, ‘G’ and ‘F’ category color are available while ‘J’ and ‘I’ are less favourable colors. There are unusual peaks lies at standard price values that exist in market.
D E F G H I J
Minimun Value 357.00 326.00 342.00 354.00 337.00 334.00 335.00
First Quantile 911.00 882.00 982.00 931.00 984.00 1120.50 1860.50
Median 1838.00 1739.00 2343.50 2242.00 3460.00 3730.00 4234.00
Third Quantile 4213.50 4003.00 4868.25 6048.00 5980.25 7201.75 7695.00
Maximum Value 18693.00 18731.00 18791.00 18818.00 18803.00 18823.00 18710.00
Mean 3169.95 3076.75 3724.89 3999.14 4486.67 5091.87 5323.82
Percentage of overall 12.56 18.16 17.69 20.93 15.39 10.05 5.21
Clarity:

This is one of the important factor that highly effect the diamond cost. It defined as the number, location and type of inclusions it contains. Inclustions can be microscopic cracks, mineral depsits or external markings. It is from ‘I1’(worst) to ‘IF’(best).

It seems that ‘SI2’ is able to keep it’s pace to increasing price while the frequency of ‘VS1’ and ‘VS2’ reduces as prices are increased. ‘IF’ is not much popular choice. The high quality ‘I1’ count is also less, this might be due to it’s high prices. The variety of ‘I1’ is present below 2000 dollar range.

I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
Minimun Value 345.00 326.00 326.00 334.00 327.00 336.00 336.00 369.00
First Quantile 2080.00 2264.00 1089.00 900.00 876.00 794.25 816.00 895.00
Median 3344.00 4072.00 2822.00 2054.00 2005.00 1311.00 1093.00 1080.00
Third Quantile 5161.00 5777.25 5250.00 6023.75 6023.00 3638.25 2379.00 2388.50
Maximum Value 18531.00 18804.00 18818.00 18823.00 18795.00 18768.00 18777.00 18806.00
Mean 3924.17 5063.03 3996.00 3924.99 3839.46 3283.74 2523.11 2864.84
Percentage of overall 1.37 17.04 24.22 22.73 15.15 9.39 6.78 3.32

Effect of Volume and Carat on Price

I choose to show these both ‘carat’ and ‘volume’ side by side plots as both carry very important relationship. The carat is the weight of the diamond and since the density of diamond is constant throughout, so for perticular volume, their will be calculated weight. That’s why both will carry appoximately similar relationship to price or any other variable.

We can see accumulation of price points at some standard values. This nature is seen throughout the analysis. They both carry exponential relationship with respect to price. So, by considering the log of price and 1/3 exponential of volume and carat, the plot shows linear relationship between both variables.

Depth and it’s effect on price and cut:

The percentage of total depth of the diamond’s proportion is greatly responsible for the amount of brilliance that diamond will display. The ‘ideal’ to ‘very fine’ cut diamonds have total depth percentage between 59% to 63%.

It’s also seen in the scatter plot that mostly points are accumulated within range price 5000, but as we move higher in price, the total depth percentage shows variance. It’s also due to the reason that many factors affect the diamonds at higher price. The difference in range can be clearly seen as fair diamond are mostly either below 58% or above 68%. The ideal diamonds have depth in between range of 60% to 63% percent.

Table and it’s effect on the Diamond Price according to Cut:

Table width is the average width of the diamond’s girdle. It’s size is less critical to the beauty of the diamond than variation of the crown and specially pavilion angle. Mostly the table width is in between 53% to 63%. The plot is colored according to diamond cut. The ideal cut is from 53% to 57% percent, very good and premium are in range from 58% to 63% and fair are above 66% mostly.

Correlation among the variables:

Linear Regresstion Model:

As carat and volume contains strong correlation, so they will carry strong relation with price. I’ve used the modified variable values to build the linear model by considering log of price and exponential 1/3 value of carat.

## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = trainingData)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + volume, data = trainingData)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + volume + color, 
##     data = trainingData)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + volume + color + 
##     clarity, data = trainingData)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + volume + color + 
##     clarity + cut, data = trainingData)
## 
## ============================================================================================
##                        m1             m2             m3            m4             m5        
## --------------------------------------------------------------------------------------------
##   (Intercept)          2.820***       2.151***      2.040***       1.506***       1.431***  
##                       (0.007)        (0.015)       (0.014)        (0.009)        (0.009)    
##   I(carat^(1/3))       5.560***       6.695***      6.699***       7.342***       7.405***  
##                       (0.008)        (0.025)       (0.022)        (0.014)        (0.014)    
##   volume                             -0.003***     -0.002***      -0.003***      -0.003***  
##                                      (0.000)       (0.000)        (0.000)        (0.000)    
##   color: .L                                        -0.402***      -0.471***      -0.469***  
##                                                    (0.004)        (0.003)        (0.003)    
##   color: .Q                                        -0.144***      -0.109***      -0.108***  
##                                                    (0.004)        (0.002)        (0.002)    
##   color: .C                                        -0.002         -0.013***      -0.011***  
##                                                    (0.004)        (0.002)        (0.002)    
##   color: ^4                                         0.030***       0.017***       0.017***  
##                                                    (0.003)        (0.002)        (0.002)    
##   color: ^5                                        -0.020***      -0.005**       -0.005*    
##                                                    (0.003)        (0.002)        (0.002)    
##   color: ^6                                        -0.025***       0.001          0.003     
##                                                    (0.003)        (0.002)        (0.002)    
##   clarity: .L                                                      0.945***       0.912***  
##                                                                   (0.005)        (0.004)    
##   clarity: .Q                                                     -0.287***      -0.273***  
##                                                                   (0.004)        (0.004)    
##   clarity: .C                                                      0.167***       0.155***  
##                                                                   (0.004)        (0.004)    
##   clarity: ^4                                                     -0.073***      -0.067***  
##                                                                   (0.003)        (0.003)    
##   clarity: ^5                                                      0.033***       0.030***  
##                                                                   (0.002)        (0.002)    
##   clarity: ^6                                                     -0.005*        -0.002     
##                                                                   (0.002)        (0.002)    
##   clarity: ^7                                                      0.029***       0.026***  
##                                                                   (0.002)        (0.002)    
##   cut: .L                                                                         0.119***  
##                                                                                  (0.003)    
##   cut: .Q                                                                        -0.031***  
##                                                                                  (0.003)    
##   cut: .C                                                                         0.020***  
##                                                                                  (0.002)    
##   cut: ^4                                                                        -0.000     
##                                                                                  (0.002)    
## --------------------------------------------------------------------------------------------
##   R-squared            0.924          0.927         0.942          0.977          0.978     
##   adj. R-squared       0.924          0.927         0.942          0.977          0.978     
##   sigma                0.280          0.273         0.245          0.154          0.151     
##   F               521237.448     275771.923     87244.122     121281.122     100570.757     
##   p                    0.000          0.000         0.000          0.000          0.000     
##   Log-likelihood   -6367.630      -5238.617      -487.061      19397.986      20436.192     
##   Deviance          3393.892       3220.865      2584.226       1028.188        979.884     
##   AIC              12741.260      10485.235       994.122     -38761.972     -40830.384     
##   BIC              12767.277      10519.924      1080.847     -38614.540     -40648.262     
##   N                43152          43152         43152          43152          43152         
## ============================================================================================
Statistical properties and their significance for model m5:
  • R-squared: It tells the proportion of variation in the dependent variable that has been explained by the model. The value for m5 is 0.97 so this shows our model is able to explain about 97% of the predicted variable by input variables.
  • Adjusted R-squared: It is helpful when we constantly add variable to the original model. As sometimes the variable(not significant), explains the variation that had been already explained in model. So, adj-R-square penalizes for total value for the number of terms in the model. For nested model, It’s always better to take a look at adj-r-square value. It’s same as R-square in our model.
  • Standard Error[sigma]: It’s measure the amount of variance shows by model. The value of SE is as closer to zero is better. For our model, the value is 0.151.
  • F-static: It’s measure of goodness of fit. The higher the value the better. Our model has 100570.757.
  • AIC and BIC [Akaike’s information criterion & Bayesian information criterion]: Both measure goodness of fit. This value should be lower the better.
Accuracy score:

It seems that our model is pretty much good as the accuracy score is 0.9893.

##              actuals predicteds
## actuals    1.0000000  0.9893356
## predicteds 0.9893356  1.0000000
Scatter Plot of predicted and actual diamond price:

Limitations and Future Scope:

This Dataset contains only values for round cut diamonds for year 2008. So, we can’t consider this as a generalized linear model to predict price. Also nowdays, many other variables are involved to define the diamond price.

If we take a look at Diamond Information website (resourse 1), It can be clearly found that many new properties are introduced like Pavilion, Culet, Fluorescence, Polish and symmetry. There are also various shaped diamonds are present in market like Radient, Emerald, Asscher, Heart, Peer and many more. By considering all these variables and properties, the more appropiate linear model can be build which will provide more suitable predictions. This will help people to get idea for price and to decide and buy diamonds in efficient and effective manner.

Resources:

  1. http://diamondse.info
  2. http://r-statistics.co/Linear-Regression.html