This report contains the Exploratory Data Analysis of famous Diamond Dataset that include the prices and other attributes of almost 54,000 diamonds and Linear Regression model based on given observations to predict the diamond price.
This dataset contains 10 variable and 53,940 values of round cut diamonds. The details of variables is as below:
Price : Price in US dollars ($326-$18,823)
Carat : Weight of the diamond (0.2-5.01)
Cut : Quality of the cut (Fair, Good, Very Good, Premium, Ideal)
Color : Diamond colour, from J (worst) to D (best)
Clarity : A measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))
X : Length in mm (0-10.74)
Y : Width in mm (0-58.9)
Z : Depth in mm (0-31.8)
Depth : Total depth percentage = (z / mean(x, y)) = (2 * z / (x + y) (43-79))
Table : Width of top of diamond relative to widest point (43-95)
You can visit this website for extra information about diamonds:
http://www.diamondse.info/
The ‘Price’, ‘Carat’, ‘X’, ‘Y’, ‘Z’, ‘Depth’, ‘Table’ are the quantitive variable while ‘Cut’, ‘Color’, ‘Clarity’ are quanlitative. I’ve introduced new variable ‘Volume’ which is derived from the dimensions i.e. ‘X’, ‘Y’, ‘Z’.
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z volume
## Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 4.720 1st Qu.: 2.910 1st Qu.: 65.14
## Median : 5.710 Median : 3.530 Median : 114.81
## Mean : 5.735 Mean : 3.539 Mean : 129.85
## 3rd Qu.: 6.540 3rd Qu.: 4.040 3rd Qu.: 170.84
## Max. :58.900 Max. :31.800 Max. :3840.60
##
The most of diamonds are available in the range of 0-2000 dollar. This is a reasonable range also because mostly people prefer to buy diamonds within this budget. We can observe a high peak near value of 1000 dollar at which 5250 diamond samples are present and there is a small peak around 2000 dollar. This is due some standard value reason as mostly we’re able to find the price tags of 999, 1499 or 1999 in the market. There is not much high change observed in the frequency of higher range of price values.
The beauty of cut reveals the true optical properties of the diamonds. In perticular, the high refractive index and color dispersion. Most diamonds present are of ‘ideal cut’ but the diamonds are also available in various different cuts. This is due to alteration in the looks or brilliance of diamond because of other factors like ‘color’, ‘carat’ who demands different kind of cut rather then ideal or if I simply say so to provide the best possible diamonds at given price, there is some trade-off in properties.
The diamonds are of ideal cut are below the range of 3000. If we move towards high price, the count of samples present are reduced. Around 50% of diamonds samples are in less then 7000 price range. The ‘Fair cut’ isn’t much polular but still exist in market with the 2.98% of total value.
| Fair | Good | Very.Good | Premium | Ideal | |
|---|---|---|---|---|---|
| Minimun Value | 337.00 | 327.00 | 336.00 | 326.00 | 326.00 |
| First Quantile | 2050.25 | 1145.00 | 912.00 | 1046.00 | 878.00 |
| Median | 3282.00 | 3050.50 | 2648.00 | 3185.00 | 1810.00 |
| Third Quantile | 5205.50 | 5028.00 | 5372.75 | 6296.00 | 4678.50 |
| Maximum Value | 18574.00 | 18788.00 | 18818.00 | 18823.00 | 18806.00 |
| Mean | 4358.76 | 3928.86 | 3981.76 | 4584.26 | 3457.54 |
| Percentage of overall | 2.98 | 9.10 | 22.40 | 25.57 | 39.95 |
The valuable dimonds are classified as colorless. But nowdays, There are diamonds present in various colors including yellow, red, green and rare color such as black. The color is graded by letter from ‘D’ to ‘J’. ‘D’(best) is for colorless and as we proceed to ‘J’(worst), it proceed to less colorlessnesss.
Mostly, the ‘H’, ‘G’ and ‘F’ category color are available while ‘J’ and ‘I’ are less favourable colors. There are unusual peaks lies at standard price values that exist in market.| D | E | F | G | H | I | J | |
|---|---|---|---|---|---|---|---|
| Minimun Value | 357.00 | 326.00 | 342.00 | 354.00 | 337.00 | 334.00 | 335.00 |
| First Quantile | 911.00 | 882.00 | 982.00 | 931.00 | 984.00 | 1120.50 | 1860.50 |
| Median | 1838.00 | 1739.00 | 2343.50 | 2242.00 | 3460.00 | 3730.00 | 4234.00 |
| Third Quantile | 4213.50 | 4003.00 | 4868.25 | 6048.00 | 5980.25 | 7201.75 | 7695.00 |
| Maximum Value | 18693.00 | 18731.00 | 18791.00 | 18818.00 | 18803.00 | 18823.00 | 18710.00 |
| Mean | 3169.95 | 3076.75 | 3724.89 | 3999.14 | 4486.67 | 5091.87 | 5323.82 |
| Percentage of overall | 12.56 | 18.16 | 17.69 | 20.93 | 15.39 | 10.05 | 5.21 |
This is one of the important factor that highly effect the diamond cost. It defined as the number, location and type of inclusions it contains. Inclustions can be microscopic cracks, mineral depsits or external markings. It is from ‘I1’(worst) to ‘IF’(best).
It seems that ‘SI2’ is able to keep it’s pace to increasing price while the frequency of ‘VS1’ and ‘VS2’ reduces as prices are increased. ‘IF’ is not much popular choice. The high quality ‘I1’ count is also less, this might be due to it’s high prices. The variety of ‘I1’ is present below 2000 dollar range.
| I1 | SI2 | SI1 | VS2 | VS1 | VVS2 | VVS1 | IF | |
|---|---|---|---|---|---|---|---|---|
| Minimun Value | 345.00 | 326.00 | 326.00 | 334.00 | 327.00 | 336.00 | 336.00 | 369.00 |
| First Quantile | 2080.00 | 2264.00 | 1089.00 | 900.00 | 876.00 | 794.25 | 816.00 | 895.00 |
| Median | 3344.00 | 4072.00 | 2822.00 | 2054.00 | 2005.00 | 1311.00 | 1093.00 | 1080.00 |
| Third Quantile | 5161.00 | 5777.25 | 5250.00 | 6023.75 | 6023.00 | 3638.25 | 2379.00 | 2388.50 |
| Maximum Value | 18531.00 | 18804.00 | 18818.00 | 18823.00 | 18795.00 | 18768.00 | 18777.00 | 18806.00 |
| Mean | 3924.17 | 5063.03 | 3996.00 | 3924.99 | 3839.46 | 3283.74 | 2523.11 | 2864.84 |
| Percentage of overall | 1.37 | 17.04 | 24.22 | 22.73 | 15.15 | 9.39 | 6.78 | 3.32 |
I choose to show these both ‘carat’ and ‘volume’ side by side plots as both carry very important relationship. The carat is the weight of the diamond and since the density of diamond is constant throughout, so for perticular volume, their will be calculated weight. That’s why both will carry appoximately similar relationship to price or any other variable.
We can see accumulation of price points at some standard values. This nature is seen throughout the analysis. They both carry exponential relationship with respect to price. So, by considering the log of price and 1/3 exponential of volume and carat, the plot shows linear relationship between both variables.
The percentage of total depth of the diamond’s proportion is greatly responsible for the amount of brilliance that diamond will display. The ‘ideal’ to ‘very fine’ cut diamonds have total depth percentage between 59% to 63%.
It’s also seen in the scatter plot that mostly points are accumulated within range price 5000, but as we move higher in price, the total depth percentage shows variance. It’s also due to the reason that many factors affect the diamonds at higher price. The difference in range can be clearly seen as fair diamond are mostly either below 58% or above 68%. The ideal diamonds have depth in between range of 60% to 63% percent.
Table width is the average width of the diamond’s girdle. It’s size is less critical to the beauty of the diamond than variation of the crown and specially pavilion angle. Mostly the table width is in between 53% to 63%. The plot is colored according to diamond cut. The ideal cut is from 53% to 57% percent, very good and premium are in range from 58% to 63% and fair are above 66% mostly.
The dimention x,y and z also shows strong correlation, but as the volume is derived from the these dimensions and is also strongly correlated. So, we can consider volume more suitable for building the linear regression model.
The depth(0.01) and table(0.13) shows very weak correlation with diamond price.
As carat and volume contains strong correlation, so they will carry strong relation with price. I’ve used the modified variable values to build the linear model by considering log of price and exponential 1/3 value of carat.
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = trainingData)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + volume, data = trainingData)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + volume + color,
## data = trainingData)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + volume + color +
## clarity, data = trainingData)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + volume + color +
## clarity + cut, data = trainingData)
##
## ============================================================================================
## m1 m2 m3 m4 m5
## --------------------------------------------------------------------------------------------
## (Intercept) 2.820*** 2.151*** 2.040*** 1.506*** 1.431***
## (0.007) (0.015) (0.014) (0.009) (0.009)
## I(carat^(1/3)) 5.560*** 6.695*** 6.699*** 7.342*** 7.405***
## (0.008) (0.025) (0.022) (0.014) (0.014)
## volume -0.003*** -0.002*** -0.003*** -0.003***
## (0.000) (0.000) (0.000) (0.000)
## color: .L -0.402*** -0.471*** -0.469***
## (0.004) (0.003) (0.003)
## color: .Q -0.144*** -0.109*** -0.108***
## (0.004) (0.002) (0.002)
## color: .C -0.002 -0.013*** -0.011***
## (0.004) (0.002) (0.002)
## color: ^4 0.030*** 0.017*** 0.017***
## (0.003) (0.002) (0.002)
## color: ^5 -0.020*** -0.005** -0.005*
## (0.003) (0.002) (0.002)
## color: ^6 -0.025*** 0.001 0.003
## (0.003) (0.002) (0.002)
## clarity: .L 0.945*** 0.912***
## (0.005) (0.004)
## clarity: .Q -0.287*** -0.273***
## (0.004) (0.004)
## clarity: .C 0.167*** 0.155***
## (0.004) (0.004)
## clarity: ^4 -0.073*** -0.067***
## (0.003) (0.003)
## clarity: ^5 0.033*** 0.030***
## (0.002) (0.002)
## clarity: ^6 -0.005* -0.002
## (0.002) (0.002)
## clarity: ^7 0.029*** 0.026***
## (0.002) (0.002)
## cut: .L 0.119***
## (0.003)
## cut: .Q -0.031***
## (0.003)
## cut: .C 0.020***
## (0.002)
## cut: ^4 -0.000
## (0.002)
## --------------------------------------------------------------------------------------------
## R-squared 0.924 0.927 0.942 0.977 0.978
## adj. R-squared 0.924 0.927 0.942 0.977 0.978
## sigma 0.280 0.273 0.245 0.154 0.151
## F 521237.448 275771.923 87244.122 121281.122 100570.757
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -6367.630 -5238.617 -487.061 19397.986 20436.192
## Deviance 3393.892 3220.865 2584.226 1028.188 979.884
## AIC 12741.260 10485.235 994.122 -38761.972 -40830.384
## BIC 12767.277 10519.924 1080.847 -38614.540 -40648.262
## N 43152 43152 43152 43152 43152
## ============================================================================================
It seems that our model is pretty much good as the accuracy score is 0.9893.
## actuals predicteds
## actuals 1.0000000 0.9893356
## predicteds 0.9893356 1.0000000
This Dataset contains only values for round cut diamonds for year 2008. So, we can’t consider this as a generalized linear model to predict price. Also nowdays, many other variables are involved to define the diamond price.
If we take a look at Diamond Information website (resourse 1), It can be clearly found that many new properties are introduced like Pavilion, Culet, Fluorescence, Polish and symmetry. There are also various shaped diamonds are present in market like Radient, Emerald, Asscher, Heart, Peer and many more. By considering all these variables and properties, the more appropiate linear model can be build which will provide more suitable predictions. This will help people to get idea for price and to decide and buy diamonds in efficient and effective manner.