College: IIT Kanpur Email Id: spamazing3097@gmail.com Name: Saurabh Pandey Project: ‘Final Report of The Project : Diamond’‘s Price Analysis’ —

INTRODUCTION

As we all know, diamonds has been an all time preference in the ornaments section. They are widely used as making different kinds of jewellary and decoration items. They are also used in many other ways like in cutting instruments since they are very hard in nature.

OVERVIEW

Our study is to find correlation of different factors in deciding the price of a diamond. The diamond’s price is decided by different factors like quality of cut, color, clarity, weight(in carats), it’s depth, table etc. So we have to deal with the data of all of these properties with the price given.

DATA

For this study, we collected the data of almost 54000 diamonds with their various properties. We picked up the data from this link(https://www.kaggle.com/shivam2503/diamonds). This is a very well known site for datasets and data analytics activities.

CONTENT

This dataset contains several columns depicting the variables onto which the price of a diamond depends. These column’s names are as follows 1- Price: Price in US dollars ($326–$18,823) 2- Carat: weight of the diamond (0.2–5.01) 3- Cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal) 4- Color: diamond colour, from J (worst) to D (best) 5- Clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) 6- x length in mm (0–10.74) 7- y width in mm (0–58.9) 8- z depth in mm (0–31.8) 9- depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79) 10- table width of top of diamond relative to widest point (43–95)

Some Insights and Correlations

Let’s get started

 setwd("~/Documents/internship related")
 diamonds.csv <-read.csv(paste("diamonds.csv",sep = ""))
 View(diamonds.csv)

For the dimensions of the dataset

dim(diamonds.csv)
## [1] 53940    11

So there are 53940 rows and 11 columns.

Now firstly we convert the character columns of this dataset into numeric factors so that the analysis becomes easy.

 diamonds.csv[, 3:5] <- sapply(diamonds.csv[, 3:5], as.numeric)
 View(diamonds.csv)

Now let’s find out the some summary statistics of this data set

library(psych)
describe(diamonds.csv)[ ,1:9]
##         vars     n     mean       sd   median  trimmed      mad   min
## X          1 53940 26970.50 15571.28 26970.50 26970.50 19992.86   1.0
## carat      2 53940     0.80     0.47     0.70     0.73     0.47   0.2
## cut        3 53940     3.55     1.03     3.00     3.60     1.48   1.0
## color      4 53940     3.59     1.70     4.00     3.55     1.48   1.0
## clarity    5 53940     4.84     1.72     5.00     4.75     1.48   1.0
## depth      6 53940    61.75     1.43    61.80    61.78     1.04  43.0
## table      7 53940    57.46     2.23    57.00    57.32     1.48  43.0
## price      8 53940  3932.80  3989.44  2401.00  3158.99  2475.94 326.0
## x          9 53940     5.73     1.12     5.70     5.66     1.38   0.0
## y         10 53940     5.73     1.14     5.71     5.66     1.36   0.0
## z         11 53940     3.54     0.71     3.53     3.49     0.85   0.0
##              max
## X       53940.00
## carat       5.01
## cut         5.00
## color       7.00
## clarity     8.00
## depth      79.00
## table      95.00
## price   18823.00
## x          10.74
## y          58.90
## z          31.80

So these are the summary statistics of all variables,

For summary statistics of prices of diamonds

describe(diamonds.csv$price)
##    vars     n   mean      sd median trimmed     mad min   max range skew
## X1    1 53940 3932.8 3989.44   2401 3158.99 2475.94 326 18823 18497 1.62
##    kurtosis    se
## X1     2.18 17.18

So the mean price is 3932.8 and median price is 2401. The prices ranges from 326 to 18823 having standard deviation of 3989.44.

For summary statistics of carats of diamonds

describe(diamonds.csv$carat)
##    vars     n mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 53940  0.8 0.47    0.7    0.73 0.47 0.2 5.01  4.81 1.12     1.26
##    se
## X1  0

So the mean weight is .8 carats and median weight is .7 carats. The weight ranges from .2 to 5.01 carats having standard deviation of .47 .

For summary statistics of depths of diamonds

describe(diamonds.csv$depth)
##    vars     n  mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 53940 61.75 1.43   61.8   61.78 1.04  43  79    36 -0.08     5.74
##      se
## X1 0.01

Now drawing one way contingency tables for categorical variables

For cuts of the diamonds

table(diamonds.csv$cut)
## 
##     1     2     3     4     5 
##  1610  4906 21551 13791 12082

For color of the diamonds

table(diamonds.csv$color)
## 
##     1     2     3     4     5     6     7 
##  6775  9797  9542 11292  8304  5422  2808

For clarity of the diamonds

table(diamonds.csv$clarity)
## 
##     1     2     3     4     5     6     7     8 
##   741  1790 13065  9194  8171 12258  3655  5066

For carats of the diamonds

table(diamonds.csv$carat)
## 
##  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29  0.3 0.31 0.32 0.33 0.34 
##   12    9    5  293  254  212  253  233  198  130 2604 2249 1840 1189  910 
## 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 
##  667  572  394  670  398 1299 1382  706  488  212  110  178   99   63   45 
##  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59  0.6 0.61 0.62 0.63 0.64 
## 1258 1127  817  709  625  496  492  430  310  282  228  204  135  102   80 
## 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 
##   65   48   48   25   26 1981 1294  764  492  322  249  251  251  187  155 
##  0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89  0.9 0.91 0.92 0.93 0.94 
##  284  200  140  131   64   62   34   31   23   21 1485  570  226  142   59 
## 0.95 0.96 0.97 0.98 0.99    1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 
##   65  103   59   31   23 1558 2242  883  523  475  361  373  342  246  287 
##  1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19  1.2 1.21 1.22 1.23 1.24 
##  278  308  251  246  207  149  172  110  123  126  645  473  300  279  236 
## 1.25 1.26 1.27 1.28 1.29  1.3 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39 
##  187  146  134  106  101  122  133   89   87   68   77   50   46   26   36 
##  1.4 1.41 1.42 1.43 1.44 1.45 1.46 1.47 1.48 1.49  1.5 1.51 1.52 1.53 1.54 
##   50   40   25   19   18   15   18   21    7   11  793  807  381  220  174 
## 1.55 1.56 1.57 1.58 1.59  1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 
##  124  109  106   89   89   95   64   61   50   43   32   30   25   19   24 
##  1.7 1.71 1.72 1.73 1.74 1.75 1.76 1.77 1.78 1.79  1.8 1.81 1.82 1.83 1.84 
##  215  119   57   52   40   50   28   17   12   15   21    9   13   18    4 
## 1.85 1.86 1.87 1.88 1.89  1.9 1.91 1.92 1.93 1.94 1.95 1.96 1.97 1.98 1.99 
##    3    9    7    4    4    7   12    2    6    3    3    4    4    5    3 
##    2 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09  2.1 2.11 2.12 2.13 2.14 
##  265  440  177  122   86   67   60   50   41   45   52   43   25   21   48 
## 2.15 2.16 2.17 2.18 2.19  2.2 2.21 2.22 2.23 2.24 2.25 2.26 2.27 2.28 2.29 
##   22   25   18   31   22   32   23   27   13   16   18   15   12   20   17 
##  2.3 2.31 2.32 2.33 2.34 2.35 2.36 2.37 2.38 2.39  2.4 2.41 2.42 2.43 2.44 
##   21   13   16    9    5    7    8    6    8    7   13    5    8    6    4 
## 2.45 2.46 2.47 2.48 2.49  2.5 2.51 2.52 2.53 2.54 2.55 2.56 2.57 2.58 2.59 
##    4    3    3    9    3   17   17    9    8    9    3    3    3    3    1 
##  2.6 2.61 2.63 2.64 2.65 2.66 2.67 2.68  2.7 2.71 2.72 2.74 2.75 2.77  2.8 
##    3    3    3    1    1    3    1    2    1    1    3    3    2    1    2 
##    3 3.01 3.02 3.04 3.05 3.11 3.22 3.24  3.4  3.5 3.51 3.65 3.67    4 4.01 
##    8   14    1    2    1    1    1    1    1    1    1    1    1    1    2 
## 4.13  4.5 5.01 
##    1    1    1

Now drawing two way contingency tables for categorical variables in the dataset

The table between cut and color

cut_color <-xtabs(~cut+color,data = diamonds.csv)
ftable(cut_color)
##     color    1    2    3    4    5    6    7
## cut                                         
## 1          163  224  312  314  303  175  119
## 2          662  933  909  871  702  522  307
## 3         2834 3903 3826 4884 3115 2093  896
## 4         1603 2337 2331 2924 2360 1428  808
## 5         1513 2400 2164 2299 1824 1204  678

The table between cut and clarity

cut_clarity <-xtabs(~cut+clarity,data = diamonds.csv)
ftable(cut_clarity)
##     clarity    1    2    3    4    5    6    7    8
## cut                                                
## 1            210    9  408  466  170  261   17   69
## 2             96   71 1560 1081  648  978  186  286
## 3            146 1212 4282 2598 3589 5071 2047 2606
## 4            205  230 3575 2949 1989 3357  616  870
## 5             84  268 3240 2100 1775 2591  789 1235

The table between color and clarity

color_clarity <-xtabs(~color+clarity,data = diamonds.csv)
ftable(color_clarity)
##       clarity    1    2    3    4    5    6    7    8
## color                                                
## 1               42   73 2083 1370  705 1697  252  553
## 2              102  158 2426 1713 1281 2470  656  991
## 3              143  385 2131 1609 1364 2201  734  975
## 4              150  681 1976 1548 2148 2347  999 1443
## 5              162  299 2275 1563 1169 1643  585  608
## 6               92  143 1424  912  962 1169  355  365
## 7               50   51  750  479  542  731   74  131

Now drawing box plots of each variable

For carat

boxplot(diamonds.csv$carat)

For cut

boxplot(diamonds.csv$cut)

For color

boxplot(diamonds.csv$color)

For clarity

boxplot(diamonds.csv$color)

For table

boxplot(diamonds.csv$table)

For depth

boxplot(diamonds.csv$table)

For price

boxplot(diamonds.csv$table)

Now drawing histograms for suitable data fields

For distribution of diamond’s weights in carats

library(lattice)
histogram(~carat, data = diamonds.csv,
main = "Distribution of diamond's weight in Carat", xlab="Carats", col='red' )

For distribution of cuts of diamonds

histogram(~cut, data = diamonds.csv,
main = "Distribution of Cuts of diamonds ", xlab="Cuts", col='blue' )

For distribution of colors of diamonds

histogram(~color, data = diamonds.csv,
main = "Distribution of Colors of diamonds ", xlab="Colors", col='blue' )

For distribution of clarity of diamonds

histogram(~clarity, data = diamonds.csv,
main = "Distribution of Clarity of diamonds ", xlab="Clarity", col='green' )

For distribution of colors of diamonds

histogram(~depth, data = diamonds.csv,
main = "Distribution of depths of diamonds ", xlab="Depths", col='skyblue' )

For distribution table of diamonds

histogram(~table, data = diamonds.csv,
main = "Distribution of Tables of diamonds ", xlab="Tables", col='blue' )

For distribution of Prices of diamonds

histogram(~price, data = diamonds.csv,
main = "Distribution of Prices of diamonds ", xlab="Prices", col='red' )

Now some plots to understand data better

For distribution of prices of diamonds with cut

boxplot(price~cut,data = diamonds.csv,main="Distribution of prices of diamonds with cut",xlab="Cut",ylab="Prices")

For distribution of prices of diamonds with weight of diamonds

plot(price~carat,data = diamonds.csv,main="Distribution of prices of diamonds with weight",xlab="Carat",ylab="Prices")

For distribution of prices of diamonds with color of diamonds

boxplot(price~color,data = diamonds.csv,main="Distribution of prices of diamonds with color",xlab="Color",ylab="Prices")

For distribution of prices of diamonds with clarity of diamonds

boxplot(price~clarity,data = diamonds.csv,main="Distribution of prices of diamonds with clarity",xlab="Clarity",ylab="Prices")

For distribution of prices of diamonds with depth of diamonds

plot(price~depth,data = diamonds.csv,main="Distribution of prices of diamonds with depth",xlab="depth",ylab="Prices")

For distribution of prices of diamonds with table of diamonds

boxplot(price~table,data = diamonds.csv,main="Distribution of prices of diamonds with table",xlab="table",ylab="Prices")

For a correlation matrix

cor(diamonds.csv)
##                   X       carat           cut         color     clarity
## X        1.00000000 -0.37798348 -0.0233272316 -0.0950979466  0.12513599
## carat   -0.37798348  1.00000000  0.0171237362  0.2914367543 -0.21429037
## cut     -0.02332723  0.01712374  1.0000000000  0.0003042479  0.02823537
## color   -0.09509795  0.29143675  0.0003042479  1.0000000000 -0.02779550
## clarity  0.12513599 -0.21429037  0.0282353656 -0.0277954960  1.00000000
## depth   -0.03480023  0.02822431 -0.1942485626  0.0472792348 -0.05308011
## table   -0.10083032  0.18161755  0.1503270263  0.0264652011 -0.08822266
## price   -0.30687318  0.92159130  0.0398602909  0.1725109282 -0.07153497
## x       -0.40544047  0.97509423  0.0223419276  0.2702866854 -0.22572144
## y       -0.39584267  0.95172220  0.0275720250  0.2635844027 -0.21761579
## z       -0.39920829  0.95338738  0.0020373568  0.2682268757 -0.22426307
##               depth       table       price           x           y
## X       -0.03480023 -0.10083032 -0.30687318 -0.40544047 -0.39584267
## carat    0.02822431  0.18161755  0.92159130  0.97509423  0.95172220
## cut     -0.19424856  0.15032703  0.03986029  0.02234193  0.02757203
## color    0.04727923  0.02646520  0.17251093  0.27028669  0.26358440
## clarity -0.05308011 -0.08822266 -0.07153497 -0.22572144 -0.21761579
## depth    1.00000000 -0.29577852 -0.01064740 -0.02528925 -0.02934067
## table   -0.29577852  1.00000000  0.12713390  0.19534428  0.18376015
## price   -0.01064740  0.12713390  1.00000000  0.88443516  0.86542090
## x       -0.02528925  0.19534428  0.88443516  1.00000000  0.97470148
## y       -0.02934067  0.18376015  0.86542090  0.97470148  1.00000000
## z        0.09492388  0.15092869  0.86124944  0.97077180  0.95200572
##                    z
## X       -0.399208287
## carat    0.953387381
## cut      0.002037357
## color    0.268226876
## clarity -0.224263069
## depth    0.094923882
## table    0.150928692
## price    0.861249444
## x        0.970771799
## y        0.952005716
## z        1.000000000

Now for visualising this let’s create a corrgram

library(corrplot)
## corrplot 0.84 loaded
corrplot.mixed(corr=cor(diamonds.csv, use="complete.obs"),lower = "shade" ,
                                                upper="pie", tl.pos="d")

Checking some hypothesis

Now let’s assume following null hypothesis : The price of a diamond is not correlated with its weight. let’s run a test to check this hypothesis

attach(diamonds.csv)
cor.test(price,carat)
## 
##  Pearson's product-moment correlation
## 
## data:  price and carat
## t = 551.41, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9203098 0.9228530
## sample estimates:
##       cor 
## 0.9215913

Since the p-value is less than .05 so we reject this null hypothesis.

Now let’s assume following null hypothesis : The price of a diamond is not correlated with its quality of cut. let’s run a test to check this hypothesis

attach(diamonds.csv)
## The following objects are masked from diamonds.csv (pos = 3):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
cor.test(price,cut)
## 
##  Pearson's product-moment correlation
## 
## data:  price and cut
## t = 9.2647, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03143180 0.04828312
## sample estimates:
##        cor 
## 0.03986029

Since the p-value is less than .05 so we reject this null hypothesis.

let’s assume following null hypothesis : The price of a diamond is not correlated with its color. let’s run a test to check this hypothesis

attach(diamonds.csv)
## The following objects are masked from diamonds.csv (pos = 3):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 4):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
cor.test(price,color)
## 
##  Pearson's product-moment correlation
## 
## data:  price and color
## t = 40.675, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1643111 0.1806869
## sample estimates:
##       cor 
## 0.1725109

Since the p-value is less than .05 so we reject this null hypothesis.

let’s assume following null hypothesis : The price of a diamond is not correlated with its clarity. let’s run a test to check this hypothesis

attach(diamonds.csv)
## The following objects are masked from diamonds.csv (pos = 3):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 4):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 5):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
cor.test(price,clarity)
## 
##  Pearson's product-moment correlation
## 
## data:  price and clarity
## t = -16.656, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07992579 -0.06313402
## sample estimates:
##         cor 
## -0.07153497

Since the p-value is less than .05 so we reject this null hypothesis.

let’s assume following null hypothesis : The price of a diamond is not correlated with its depth. let’s run a test to check this hypothesis

attach(diamonds.csv)
## The following objects are masked from diamonds.csv (pos = 3):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 4):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 5):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 6):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
cor.test(price,depth)
## 
##  Pearson's product-moment correlation
## 
## data:  price and depth
## t = -2.473, df = 53938, p-value = 0.0134
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.019084756 -0.002208537
## sample estimates:
##        cor 
## -0.0106474

Since the p-value is less than .05 so we reject this null hypothesis.

let’s assume following null hypothesis : The price of a diamond is not correlated with its table. let’s run a test to check this hypothesis

attach(diamonds.csv)
## The following objects are masked from diamonds.csv (pos = 3):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 4):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 5):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 6):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 7):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
cor.test(price,table)
## 
##  Pearson's product-moment correlation
## 
## data:  price and table
## t = 29.768, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1188223 0.1354277
## sample estimates:
##       cor 
## 0.1271339

Since the p-value is less than .05 so we reject this null hypothesis.

let’s assume following null hypothesis : The price of a diamond is not correlated with its x-length. let’s run a test to check this hypothesis

attach(diamonds.csv)
## The following objects are masked from diamonds.csv (pos = 3):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 4):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 5):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 6):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 7):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 8):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
cor.test(price,x)
## 
##  Pearson's product-moment correlation
## 
## data:  price and x
## t = 440.16, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8825835 0.8862594
## sample estimates:
##       cor 
## 0.8844352

Since the p-value is less than .05 so we reject this null hypothesis.

let’s assume following null hypothesis : The price of a diamond is not correlated with its y-length. let’s run a test to check this hypothesis

attach(diamonds.csv)
## The following objects are masked from diamonds.csv (pos = 3):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 4):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 5):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 6):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 7):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 8):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 9):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
cor.test(price,y)
## 
##  Pearson's product-moment correlation
## 
## data:  price and y
## t = 401.14, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8632867 0.8675241
## sample estimates:
##       cor 
## 0.8654209

Since the p-value is less than .05 so we reject this null hypothesis.

let’s assume following null hypothesis : The price of a diamond is not correlated with its z-length. let’s run a test to check this hypothesis

attach(diamonds.csv)
## The following objects are masked from diamonds.csv (pos = 3):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 4):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 5):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 6):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 7):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 8):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 9):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
## The following objects are masked from diamonds.csv (pos = 10):
## 
##     carat, clarity, color, cut, depth, price, table, x, X, y, z
cor.test(price,z)
## 
##  Pearson's product-moment correlation
## 
## data:  price and z
## t = 393.6, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8590541 0.8634131
## sample estimates:
##       cor 
## 0.8612494

Since the p-value is less than .05 so we reject this null hypothesis.

Since every column variable has more than 2 components so it’s not possible to run t-test.

Now The regression model

In order to find the effects of various variables on price, we hypothize the following model

Price = E + CaratE(0) + CutE(1) + ColorE(2) + ClarityE(3) + DepthE(4) + TableE(5) + x-LengthE(6) + y-LengthE(7) + z-Length*E(8) + E(9)

where E(0),E(1) etc are coefficients of predictive variables whereas E is the intercept and E(9) is the error term

So let’s run this model

 model <- price~carat+cut+color+clarity+depth+table+x+y+z
 reg <-lm(model,data = diamonds.csv)
 summary(reg)
## 
## Call:
## lm(formula = model, data = diamonds.csv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23527.6   -647.5   -149.0    426.3  12677.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15902.949    412.484  38.554   <2e-16 ***
## carat       10978.275     57.590 190.630   <2e-16 ***
## cut            70.691      5.812  12.163   <2e-16 ***
## color        -266.452      3.590 -74.214   <2e-16 ***
## clarity       287.847      3.488  82.535   <2e-16 ***
## depth        -154.298      5.043 -30.596   <2e-16 ***
## table         -93.316      2.808 -33.238   <2e-16 ***
## x           -1184.925     39.036 -30.355   <2e-16 ***
## y              47.269     23.068   2.049   0.0405 *  
## z              -1.688     40.038  -0.042   0.9664    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1353 on 53930 degrees of freedom
## Multiple R-squared:  0.8851, Adjusted R-squared:  0.8851 
## F-statistic: 4.615e+04 on 9 and 53930 DF,  p-value: < 2.2e-16

We regressed the effect of weight, quality of cut, color, clarity, depth, table etc on the price of diamonds and now we got the p-values regarding them and their coefficient’s values here.

Result

As we can see from the linear regression that the p-values of carat,cut,color,clarity,depth,table,x-length and y-length are less than .05 so these are significant predictive variables whereas the z-length is non-significant one. Also we can observe that the effect of weight on the price is much higher than that of other variables. As the weight increases, the price increases in the significant amount(10978.27 per carat increase). Some of the factors heavily decrease the price when they are increased, for example color, depth , table and x-length. The x-length heavily decreases the price relative to other negative contributing factors. The table offers low decrement on price. The clarity offers a decent increment in price when it is increased but less than that of weight’s effect. The cut and y-length offer minute effect on the price.

Conclusion

So we investigated the effect of diamond’s special qualities like cut quality, weight, clarity etc on it’s price and developed a linear model to decide the price dependence on these factors.

Reference

The data set is downloaded from the link(https://www.kaggle.com/shivam2503/diamonds) and all the analysis is performed under guidance of Prof. Sameer Mathur

Thanks