This classic dataset contains the prices and other attributes of almost 54,000 diamonds. And the contents of this data set are as follows 1- Price: Price in US dollars ($326–$18,823) 2- Carat: weight of the diamond (0.2–5.01) 3- Cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal) 4- Color: diamond colour, from J (worst) to D (best) 5- Clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) 6- x length in mm (0–10.74) 7- y width in mm (0–58.9) 8- z depth in mm (0–31.8) 9- depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79) 10- table width of top of diamond relative to widest point (43–95)

So We’ll study the effect of these factors onto the price of a diamond, after this analysis we’ll get to know about which factor have major role in deciding a diamond’s price. Let’s get started

setwd("C:/Users/SAURAB~1/AppData/Local/Temp/Rar$DIa0.525")
 diamonds.csv <-read.csv(paste("diamonds.csv",sep = ""))
 View(diamonds.csv)

For the dimensions of the dataset

dim(diamonds.csv)

## [1] 53940    11

So there are 53940 rows and 11 columns.

Now firstly we convert the character columns of this dataset into numeric factors so that the analysis becomes easy.

 diamonds.csv[, 3:5] <- sapply(diamonds.csv[, 3:5], as.numeric)
 View(diamonds.csv)

Now let’s find out the some summary statistics of this data set

library(psych)
describe(diamonds.csv)[ ,1:9]

##         vars     n     mean       sd   median  trimmed      mad   min
## X          1 53940 26970.50 15571.28 26970.50 26970.50 19992.86   1.0
## carat      2 53940     0.80     0.47     0.70     0.73     0.47   0.2
## cut        3 53940     3.55     1.03     3.00     3.60     1.48   1.0
## color      4 53940     3.59     1.70     4.00     3.55     1.48   1.0
## clarity    5 53940     4.84     1.72     5.00     4.75     1.48   1.0
## depth      6 53940    61.75     1.43    61.80    61.78     1.04  43.0
## table      7 53940    57.46     2.23    57.00    57.32     1.48  43.0
## price      8 53940  3932.80  3989.44  2401.00  3158.99  2475.94 326.0
## x          9 53940     5.73     1.12     5.70     5.66     1.38   0.0
## y         10 53940     5.73     1.14     5.71     5.66     1.36   0.0
## z         11 53940     3.54     0.71     3.53     3.49     0.85   0.0
##              max
## X       53940.00
## carat       5.01
## cut         5.00
## color       7.00
## clarity     8.00
## depth      79.00
## table      95.00
## price   18823.00
## x          10.74
## y          58.90
## z          31.80

So these are the summary statistics of all variables,

For summary statistics of prices of diamonds

describe(diamonds.csv$price)

##    vars     n   mean      sd median trimmed     mad min   max range skew
## X1    1 53940 3932.8 3989.44   2401 3158.99 2475.94 326 18823 18497 1.62
##    kurtosis    se
## X1     2.18 17.18

So the mean price is 3932.8 and median price is 2401. The prices ranges from 326 to 18823 having standard deviation of 3989.44.

For summary statistics of carats of diamonds

describe(diamonds.csv$carat)

##    vars     n mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 53940  0.8 0.47    0.7    0.73 0.47 0.2 5.01  4.81 1.12     1.26
##    se
## X1  0

So the mean weight is .8 carats and median weight is .7 carats. The weight ranges from .2 to 5.01 carats having standard deviation of .47 .

For summary statistics of depths of diamonds

describe(diamonds.csv$depth)

##    vars     n  mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 53940 61.75 1.43   61.8   61.78 1.04  43  79    36 -0.08     5.74
##      se
## X1 0.01

Now drawing one way contingency tables for categorical variables

For cuts of the diamonds

table(diamonds.csv$cut)

## 
##     1     2     3     4     5 
##  1610  4906 21551 13791 12082

For color of the diamonds

table(diamonds.csv$color)

## 
##     1     2     3     4     5     6     7 
##  6775  9797  9542 11292  8304  5422  2808

For clarity of the diamonds

table(diamonds.csv$clarity)

## 
##     1     2     3     4     5     6     7     8 
##   741  1790 13065  9194  8171 12258  3655  5066

For carats of the diamonds

table(diamonds.csv$carat)

## 
##  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29  0.3 0.31 0.32 0.33 0.34 
##   12    9    5  293  254  212  253  233  198  130 2604 2249 1840 1189  910 
## 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 
##  667  572  394  670  398 1299 1382  706  488  212  110  178   99   63   45 
##  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59  0.6 0.61 0.62 0.63 0.64 
## 1258 1127  817  709  625  496  492  430  310  282  228  204  135  102   80 
## 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 
##   65   48   48   25   26 1981 1294  764  492  322  249  251  251  187  155 
##  0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89  0.9 0.91 0.92 0.93 0.94 
##  284  200  140  131   64   62   34   31   23   21 1485  570  226  142   59 
## 0.95 0.96 0.97 0.98 0.99    1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 
##   65  103   59   31   23 1558 2242  883  523  475  361  373  342  246  287 
##  1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19  1.2 1.21 1.22 1.23 1.24 
##  278  308  251  246  207  149  172  110  123  126  645  473  300  279  236 
## 1.25 1.26 1.27 1.28 1.29  1.3 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39 
##  187  146  134  106  101  122  133   89   87   68   77   50   46   26   36 
##  1.4 1.41 1.42 1.43 1.44 1.45 1.46 1.47 1.48 1.49  1.5 1.51 1.52 1.53 1.54 
##   50   40   25   19   18   15   18   21    7   11  793  807  381  220  174 
## 1.55 1.56 1.57 1.58 1.59  1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 
##  124  109  106   89   89   95   64   61   50   43   32   30   25   19   24 
##  1.7 1.71 1.72 1.73 1.74 1.75 1.76 1.77 1.78 1.79  1.8 1.81 1.82 1.83 1.84 
##  215  119   57   52   40   50   28   17   12   15   21    9   13   18    4 
## 1.85 1.86 1.87 1.88 1.89  1.9 1.91 1.92 1.93 1.94 1.95 1.96 1.97 1.98 1.99 
##    3    9    7    4    4    7   12    2    6    3    3    4    4    5    3 
##    2 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09  2.1 2.11 2.12 2.13 2.14 
##  265  440  177  122   86   67   60   50   41   45   52   43   25   21   48 
## 2.15 2.16 2.17 2.18 2.19  2.2 2.21 2.22 2.23 2.24 2.25 2.26 2.27 2.28 2.29 
##   22   25   18   31   22   32   23   27   13   16   18   15   12   20   17 
##  2.3 2.31 2.32 2.33 2.34 2.35 2.36 2.37 2.38 2.39  2.4 2.41 2.42 2.43 2.44 
##   21   13   16    9    5    7    8    6    8    7   13    5    8    6    4 
## 2.45 2.46 2.47 2.48 2.49  2.5 2.51 2.52 2.53 2.54 2.55 2.56 2.57 2.58 2.59 
##    4    3    3    9    3   17   17    9    8    9    3    3    3    3    1 
##  2.6 2.61 2.63 2.64 2.65 2.66 2.67 2.68  2.7 2.71 2.72 2.74 2.75 2.77  2.8 
##    3    3    3    1    1    3    1    2    1    1    3    3    2    1    2 
##    3 3.01 3.02 3.04 3.05 3.11 3.22 3.24  3.4  3.5 3.51 3.65 3.67    4 4.01 
##    8   14    1    2    1    1    1    1    1    1    1    1    1    1    2 
## 4.13  4.5 5.01 
##    1    1    1

Now drawing two way contingency tables for categorical variables in the dataset

The table between cut and color

cut_color <-xtabs(~cut+color,data = diamonds.csv)
ftable(cut_color)

##     color    1    2    3    4    5    6    7
## cut                                         
## 1          163  224  312  314  303  175  119
## 2          662  933  909  871  702  522  307
## 3         2834 3903 3826 4884 3115 2093  896
## 4         1603 2337 2331 2924 2360 1428  808
## 5         1513 2400 2164 2299 1824 1204  678

The table between cut and clarity

cut_clarity <-xtabs(~cut+clarity,data = diamonds.csv)
ftable(cut_clarity)

##     clarity    1    2    3    4    5    6    7    8
## cut                                                
## 1            210    9  408  466  170  261   17   69
## 2             96   71 1560 1081  648  978  186  286
## 3            146 1212 4282 2598 3589 5071 2047 2606
## 4            205  230 3575 2949 1989 3357  616  870
## 5             84  268 3240 2100 1775 2591  789 1235

The table between color and clarity

color_clarity <-xtabs(~color+clarity,data = diamonds.csv)
ftable(color_clarity)

##       clarity    1    2    3    4    5    6    7    8
## color                                                
## 1               42   73 2083 1370  705 1697  252  553
## 2              102  158 2426 1713 1281 2470  656  991
## 3              143  385 2131 1609 1364 2201  734  975
## 4              150  681 1976 1548 2148 2347  999 1443
## 5              162  299 2275 1563 1169 1643  585  608
## 6               92  143 1424  912  962 1169  355  365
## 7               50   51  750  479  542  731   74  131

Now drawing box plots of each variable

For carat

boxplot(diamonds.csv$carat)

For cut

boxplot(diamonds.csv$cut)

For color

boxplot(diamonds.csv$color)

For clarity

boxplot(diamonds.csv$color)

For table

boxplot(diamonds.csv$table)

For depth

boxplot(diamonds.csv$table)

For price

boxplot(diamonds.csv$table)

Now drawing histograms for suitable data fields

For distribution of diamond’s weights in carats

library(lattice)
histogram(~carat, data = diamonds.csv,
main = "Distribution of diamond's weight in Carat", xlab="Carats", col='red' )

For distribution of cuts of diamonds

histogram(~cut, data = diamonds.csv,
main = "Distribution of Cuts of diamonds ", xlab="Cuts", col='blue' )

For distribution of colors of diamonds

histogram(~color, data = diamonds.csv,
main = "Distribution of Colors of diamonds ", xlab="Colors", col='blue' )

For distribution of clarity of diamonds

histogram(~clarity, data = diamonds.csv,
main = "Distribution of Clarity of diamonds ", xlab="Clarity", col='green' )

For distribution of colors of diamonds

histogram(~depth, data = diamonds.csv,
main = "Distribution of depths of diamonds ", xlab="Depths", col='skyblue' )

For distribution table of diamonds

histogram(~table, data = diamonds.csv,
main = "Distribution of Tables of diamonds ", xlab="Tables", col='blue' )

For distribution of Prices of diamonds

histogram(~price, data = diamonds.csv,
main = "Distribution of Prices of diamonds ", xlab="Prices", col='red' )

Now some plots to understand data better

For distribution of prices of diamonds with cut

boxplot(price~cut,data = diamonds.csv,main="Distribution of prices of diamonds with cut",xlab="Cut",ylab="Prices")

For distribution of prices of diamonds with weight of diamonds

plot(price~carat,data = diamonds.csv,main="Distribution of prices of diamonds with weight",xlab="Carat",ylab="Prices")

For distribution of prices of diamonds with color of diamonds

boxplot(price~color,data = diamonds.csv,main="Distribution of prices of diamonds with color",xlab="Color",ylab="Prices")

For distribution of prices of diamonds with clarity of diamonds

boxplot(price~clarity,data = diamonds.csv,main="Distribution of prices of diamonds with clarity",xlab="Clarity",ylab="Prices")

For distribution of prices of diamonds with depth of diamonds

plot(price~depth,data = diamonds.csv,main="Distribution of prices of diamonds with depth",xlab="depth",ylab="Prices")

For distribution of prices of diamonds with table of diamonds

boxplot(price~table,data = diamonds.csv,main="Distribution of prices of diamonds with table",xlab="table",ylab="Prices")

For a correlation matrix

cor(diamonds.csv)

##                   X       carat           cut         color     clarity
## X        1.00000000 -0.37798348 -0.0233272316 -0.0950979466  0.12513599
## carat   -0.37798348  1.00000000  0.0171237362  0.2914367543 -0.21429037
## cut     -0.02332723  0.01712374  1.0000000000  0.0003042479  0.02823537
## color   -0.09509795  0.29143675  0.0003042479  1.0000000000 -0.02779550
## clarity  0.12513599 -0.21429037  0.0282353656 -0.0277954960  1.00000000
## depth   -0.03480023  0.02822431 -0.1942485626  0.0472792348 -0.05308011
## table   -0.10083032  0.18161755  0.1503270263  0.0264652011 -0.08822266
## price   -0.30687318  0.92159130  0.0398602909  0.1725109282 -0.07153497
## x       -0.40544047  0.97509423  0.0223419276  0.2702866854 -0.22572144
## y       -0.39584267  0.95172220  0.0275720250  0.2635844027 -0.21761579
## z       -0.39920829  0.95338738  0.0020373568  0.2682268757 -0.22426307
##               depth       table       price           x           y
## X       -0.03480023 -0.10083032 -0.30687318 -0.40544047 -0.39584267
## carat    0.02822431  0.18161755  0.92159130  0.97509423  0.95172220
## cut     -0.19424856  0.15032703  0.03986029  0.02234193  0.02757203
## color    0.04727923  0.02646520  0.17251093  0.27028669  0.26358440
## clarity -0.05308011 -0.08822266 -0.07153497 -0.22572144 -0.21761579
## depth    1.00000000 -0.29577852 -0.01064740 -0.02528925 -0.02934067
## table   -0.29577852  1.00000000  0.12713390  0.19534428  0.18376015
## price   -0.01064740  0.12713390  1.00000000  0.88443516  0.86542090
## x       -0.02528925  0.19534428  0.88443516  1.00000000  0.97470148
## y       -0.02934067  0.18376015  0.86542090  0.97470148  1.00000000
## z        0.09492388  0.15092869  0.86124944  0.97077180  0.95200572
##                    z
## X       -0.399208287
## carat    0.953387381
## cut      0.002037357
## color    0.268226876
## clarity -0.224263069
## depth    0.094923882
## table    0.150928692
## price    0.861249444
## x        0.970771799
## y        0.952005716
## z        1.000000000

Now for visualising this let’s create a corrgram

library(corrplot)

## corrplot 0.84 loaded

corrplot.mixed(corr=cor(diamonds.csv, use="complete.obs"),lower = "shade" ,
                                                upper="pie", tl.pos="d")

Now for scatterplot of the variables

library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

model <- ~price+cut+carat+clarity+color+depth+table
scatterplotMatrix(formula = model,
                     data=diamonds.csv,
                    main = "Scatter Plot")