This classic dataset contains the prices and other attributes of almost 54,000 diamonds. And the contents of this data set are as follows 1- Price: Price in US dollars ($326–$18,823) 2- Carat: weight of the diamond (0.2–5.01) 3- Cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal) 4- Color: diamond colour, from J (worst) to D (best) 5- Clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) 6- x length in mm (0–10.74) 7- y width in mm (0–58.9) 8- z depth in mm (0–31.8) 9- depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79) 10- table width of top of diamond relative to widest point (43–95)
So We’ll study the effect of these factors onto the price of a diamond, after this analysis we’ll get to know about which factor have major role in deciding a diamond’s price. Let’s get started
setwd("C:/Users/SAURAB~1/AppData/Local/Temp/Rar$DIa0.525")
diamonds.csv <-read.csv(paste("diamonds.csv",sep = ""))
View(diamonds.csv)
dim(diamonds.csv)
## [1] 53940 11
So there are 53940 rows and 11 columns.
Now firstly we convert the character columns of this dataset into numeric factors so that the analysis becomes easy.
diamonds.csv[, 3:5] <- sapply(diamonds.csv[, 3:5], as.numeric)
View(diamonds.csv)
library(psych)
describe(diamonds.csv)[ ,1:9]
## vars n mean sd median trimmed mad min
## X 1 53940 26970.50 15571.28 26970.50 26970.50 19992.86 1.0
## carat 2 53940 0.80 0.47 0.70 0.73 0.47 0.2
## cut 3 53940 3.55 1.03 3.00 3.60 1.48 1.0
## color 4 53940 3.59 1.70 4.00 3.55 1.48 1.0
## clarity 5 53940 4.84 1.72 5.00 4.75 1.48 1.0
## depth 6 53940 61.75 1.43 61.80 61.78 1.04 43.0
## table 7 53940 57.46 2.23 57.00 57.32 1.48 43.0
## price 8 53940 3932.80 3989.44 2401.00 3158.99 2475.94 326.0
## x 9 53940 5.73 1.12 5.70 5.66 1.38 0.0
## y 10 53940 5.73 1.14 5.71 5.66 1.36 0.0
## z 11 53940 3.54 0.71 3.53 3.49 0.85 0.0
## max
## X 53940.00
## carat 5.01
## cut 5.00
## color 7.00
## clarity 8.00
## depth 79.00
## table 95.00
## price 18823.00
## x 10.74
## y 58.90
## z 31.80
So these are the summary statistics of all variables,
For summary statistics of prices of diamonds
describe(diamonds.csv$price)
## vars n mean sd median trimmed mad min max range skew
## X1 1 53940 3932.8 3989.44 2401 3158.99 2475.94 326 18823 18497 1.62
## kurtosis se
## X1 2.18 17.18
So the mean price is 3932.8 and median price is 2401. The prices ranges from 326 to 18823 having standard deviation of 3989.44.
For summary statistics of carats of diamonds
describe(diamonds.csv$carat)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 53940 0.8 0.47 0.7 0.73 0.47 0.2 5.01 4.81 1.12 1.26
## se
## X1 0
So the mean weight is .8 carats and median weight is .7 carats. The weight ranges from .2 to 5.01 carats having standard deviation of .47 .
For summary statistics of depths of diamonds
describe(diamonds.csv$depth)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 53940 61.75 1.43 61.8 61.78 1.04 43 79 36 -0.08 5.74
## se
## X1 0.01
For cuts of the diamonds
table(diamonds.csv$cut)
##
## 1 2 3 4 5
## 1610 4906 21551 13791 12082
For color of the diamonds
table(diamonds.csv$color)
##
## 1 2 3 4 5 6 7
## 6775 9797 9542 11292 8304 5422 2808
For clarity of the diamonds
table(diamonds.csv$clarity)
##
## 1 2 3 4 5 6 7 8
## 741 1790 13065 9194 8171 12258 3655 5066
For carats of the diamonds
table(diamonds.csv$carat)
##
## 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34
## 12 9 5 293 254 212 253 233 198 130 2604 2249 1840 1189 910
## 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49
## 667 572 394 670 398 1299 1382 706 488 212 110 178 99 63 45
## 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64
## 1258 1127 817 709 625 496 492 430 310 282 228 204 135 102 80
## 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79
## 65 48 48 25 26 1981 1294 764 492 322 249 251 251 187 155
## 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94
## 284 200 140 131 64 62 34 31 23 21 1485 570 226 142 59
## 0.95 0.96 0.97 0.98 0.99 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09
## 65 103 59 31 23 1558 2242 883 523 475 361 373 342 246 287
## 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24
## 278 308 251 246 207 149 172 110 123 126 645 473 300 279 236
## 1.25 1.26 1.27 1.28 1.29 1.3 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39
## 187 146 134 106 101 122 133 89 87 68 77 50 46 26 36
## 1.4 1.41 1.42 1.43 1.44 1.45 1.46 1.47 1.48 1.49 1.5 1.51 1.52 1.53 1.54
## 50 40 25 19 18 15 18 21 7 11 793 807 381 220 174
## 1.55 1.56 1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69
## 124 109 106 89 89 95 64 61 50 43 32 30 25 19 24
## 1.7 1.71 1.72 1.73 1.74 1.75 1.76 1.77 1.78 1.79 1.8 1.81 1.82 1.83 1.84
## 215 119 57 52 40 50 28 17 12 15 21 9 13 18 4
## 1.85 1.86 1.87 1.88 1.89 1.9 1.91 1.92 1.93 1.94 1.95 1.96 1.97 1.98 1.99
## 3 9 7 4 4 7 12 2 6 3 3 4 4 5 3
## 2 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09 2.1 2.11 2.12 2.13 2.14
## 265 440 177 122 86 67 60 50 41 45 52 43 25 21 48
## 2.15 2.16 2.17 2.18 2.19 2.2 2.21 2.22 2.23 2.24 2.25 2.26 2.27 2.28 2.29
## 22 25 18 31 22 32 23 27 13 16 18 15 12 20 17
## 2.3 2.31 2.32 2.33 2.34 2.35 2.36 2.37 2.38 2.39 2.4 2.41 2.42 2.43 2.44
## 21 13 16 9 5 7 8 6 8 7 13 5 8 6 4
## 2.45 2.46 2.47 2.48 2.49 2.5 2.51 2.52 2.53 2.54 2.55 2.56 2.57 2.58 2.59
## 4 3 3 9 3 17 17 9 8 9 3 3 3 3 1
## 2.6 2.61 2.63 2.64 2.65 2.66 2.67 2.68 2.7 2.71 2.72 2.74 2.75 2.77 2.8
## 3 3 3 1 1 3 1 2 1 1 3 3 2 1 2
## 3 3.01 3.02 3.04 3.05 3.11 3.22 3.24 3.4 3.5 3.51 3.65 3.67 4 4.01
## 8 14 1 2 1 1 1 1 1 1 1 1 1 1 2
## 4.13 4.5 5.01
## 1 1 1
The table between cut and color
cut_color <-xtabs(~cut+color,data = diamonds.csv)
ftable(cut_color)
## color 1 2 3 4 5 6 7
## cut
## 1 163 224 312 314 303 175 119
## 2 662 933 909 871 702 522 307
## 3 2834 3903 3826 4884 3115 2093 896
## 4 1603 2337 2331 2924 2360 1428 808
## 5 1513 2400 2164 2299 1824 1204 678
The table between cut and clarity
cut_clarity <-xtabs(~cut+clarity,data = diamonds.csv)
ftable(cut_clarity)
## clarity 1 2 3 4 5 6 7 8
## cut
## 1 210 9 408 466 170 261 17 69
## 2 96 71 1560 1081 648 978 186 286
## 3 146 1212 4282 2598 3589 5071 2047 2606
## 4 205 230 3575 2949 1989 3357 616 870
## 5 84 268 3240 2100 1775 2591 789 1235
The table between color and clarity
color_clarity <-xtabs(~color+clarity,data = diamonds.csv)
ftable(color_clarity)
## clarity 1 2 3 4 5 6 7 8
## color
## 1 42 73 2083 1370 705 1697 252 553
## 2 102 158 2426 1713 1281 2470 656 991
## 3 143 385 2131 1609 1364 2201 734 975
## 4 150 681 1976 1548 2148 2347 999 1443
## 5 162 299 2275 1563 1169 1643 585 608
## 6 92 143 1424 912 962 1169 355 365
## 7 50 51 750 479 542 731 74 131
For carat
boxplot(diamonds.csv$carat)
For cut
boxplot(diamonds.csv$cut)
For color
boxplot(diamonds.csv$color)
For clarity
boxplot(diamonds.csv$color)
For table
boxplot(diamonds.csv$table)
For depth
boxplot(diamonds.csv$table)
For price
boxplot(diamonds.csv$table)
For distribution of diamond’s weights in carats
library(lattice)
histogram(~carat, data = diamonds.csv,
main = "Distribution of diamond's weight in Carat", xlab="Carats", col='red' )
For distribution of cuts of diamonds
histogram(~cut, data = diamonds.csv,
main = "Distribution of Cuts of diamonds ", xlab="Cuts", col='blue' )
For distribution of colors of diamonds
histogram(~color, data = diamonds.csv,
main = "Distribution of Colors of diamonds ", xlab="Colors", col='blue' )
For distribution of clarity of diamonds
histogram(~clarity, data = diamonds.csv,
main = "Distribution of Clarity of diamonds ", xlab="Clarity", col='green' )
For distribution of colors of diamonds
histogram(~depth, data = diamonds.csv,
main = "Distribution of depths of diamonds ", xlab="Depths", col='skyblue' )
For distribution table of diamonds
histogram(~table, data = diamonds.csv,
main = "Distribution of Tables of diamonds ", xlab="Tables", col='blue' )
For distribution of Prices of diamonds
histogram(~price, data = diamonds.csv,
main = "Distribution of Prices of diamonds ", xlab="Prices", col='red' )
For distribution of prices of diamonds with cut
boxplot(price~cut,data = diamonds.csv,main="Distribution of prices of diamonds with cut",xlab="Cut",ylab="Prices")
For distribution of prices of diamonds with weight of diamonds
plot(price~carat,data = diamonds.csv,main="Distribution of prices of diamonds with weight",xlab="Carat",ylab="Prices")
For distribution of prices of diamonds with color of diamonds
boxplot(price~color,data = diamonds.csv,main="Distribution of prices of diamonds with color",xlab="Color",ylab="Prices")
For distribution of prices of diamonds with clarity of diamonds
boxplot(price~clarity,data = diamonds.csv,main="Distribution of prices of diamonds with clarity",xlab="Clarity",ylab="Prices")
For distribution of prices of diamonds with depth of diamonds
plot(price~depth,data = diamonds.csv,main="Distribution of prices of diamonds with depth",xlab="depth",ylab="Prices")
For distribution of prices of diamonds with table of diamonds
boxplot(price~table,data = diamonds.csv,main="Distribution of prices of diamonds with table",xlab="table",ylab="Prices")
cor(diamonds.csv)
## X carat cut color clarity
## X 1.00000000 -0.37798348 -0.0233272316 -0.0950979466 0.12513599
## carat -0.37798348 1.00000000 0.0171237362 0.2914367543 -0.21429037
## cut -0.02332723 0.01712374 1.0000000000 0.0003042479 0.02823537
## color -0.09509795 0.29143675 0.0003042479 1.0000000000 -0.02779550
## clarity 0.12513599 -0.21429037 0.0282353656 -0.0277954960 1.00000000
## depth -0.03480023 0.02822431 -0.1942485626 0.0472792348 -0.05308011
## table -0.10083032 0.18161755 0.1503270263 0.0264652011 -0.08822266
## price -0.30687318 0.92159130 0.0398602909 0.1725109282 -0.07153497
## x -0.40544047 0.97509423 0.0223419276 0.2702866854 -0.22572144
## y -0.39584267 0.95172220 0.0275720250 0.2635844027 -0.21761579
## z -0.39920829 0.95338738 0.0020373568 0.2682268757 -0.22426307
## depth table price x y
## X -0.03480023 -0.10083032 -0.30687318 -0.40544047 -0.39584267
## carat 0.02822431 0.18161755 0.92159130 0.97509423 0.95172220
## cut -0.19424856 0.15032703 0.03986029 0.02234193 0.02757203
## color 0.04727923 0.02646520 0.17251093 0.27028669 0.26358440
## clarity -0.05308011 -0.08822266 -0.07153497 -0.22572144 -0.21761579
## depth 1.00000000 -0.29577852 -0.01064740 -0.02528925 -0.02934067
## table -0.29577852 1.00000000 0.12713390 0.19534428 0.18376015
## price -0.01064740 0.12713390 1.00000000 0.88443516 0.86542090
## x -0.02528925 0.19534428 0.88443516 1.00000000 0.97470148
## y -0.02934067 0.18376015 0.86542090 0.97470148 1.00000000
## z 0.09492388 0.15092869 0.86124944 0.97077180 0.95200572
## z
## X -0.399208287
## carat 0.953387381
## cut 0.002037357
## color 0.268226876
## clarity -0.224263069
## depth 0.094923882
## table 0.150928692
## price 0.861249444
## x 0.970771799
## y 0.952005716
## z 1.000000000
library(corrplot)
## corrplot 0.84 loaded
corrplot.mixed(corr=cor(diamonds.csv, use="complete.obs"),lower = "shade" ,
upper="pie", tl.pos="d")
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
model <- ~price+cut+carat+clarity+color+depth+table
scatterplotMatrix(formula = model,
data=diamonds.csv,
main = "Scatter Plot")