Read Data from local source
diamonds <-read.table (file = "C:/Users/Layla Habibullah/Desktop/diamonds.csv", header= TRUE, sep =",")
Diamond is a stone which is very popular among the people for different purposes especially as jewelry. There are different categories of diamonds which makes it either cheaper or very expensive. Data set of diamond has been taken in this final project to identify the relationship between the price of diamonds with other factors such as carat, color, cut, clarity and depth. This project focuses on identifying factors causing changes in the price of diamonds in the current data set which is downloaded from data source online.
The aims of the project are: 1. To identify the basic statistics of the diamond dataset. For instance mean, median, number of rows and columns and quartiles with conclusion 2. To do data wrangling in the dataset to make it better (Data aggregation, changing column names and changing the class categories as per the requirement) 3. Exploration of the data through graphical representation to identify the relationship between price and carat with respect to color, cut and clarity 4. Conclude the findings
head(diamonds)
## X carat cut color clarity depth table price x y z
## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
summary(diamonds)
## X carat cut color
## Min. : 1 Min. :0.2000 Fair : 1610 D: 6775
## 1st Qu.:13486 1st Qu.:0.4000 Good : 4906 E: 9797
## Median :26971 Median :0.7000 Ideal :21551 F: 9542
## Mean :26971 Mean :0.7979 Premium :13791 G:11292
## 3rd Qu.:40455 3rd Qu.:1.0400 Very Good:12082 H: 8304
## Max. :53940 Max. :5.0100 I: 5422
## J: 2808
## clarity depth table price
## SI1 :13065 Min. :43.00 Min. :43.00 Min. : 326
## VS2 :12258 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950
## SI2 : 9194 Median :61.80 Median :57.00 Median : 2401
## VS1 : 8171 Mean :61.75 Mean :57.46 Mean : 3933
## VVS2 : 5066 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324
## VVS1 : 3655 Max. :79.00 Max. :95.00 Max. :18823
## (Other): 2531
## x y z
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.700 Median : 5.710 Median : 3.530
## Mean : 5.731 Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :10.740 Max. :58.900 Max. :31.800
##
str(diamonds)
## 'data.frame': 53940 obs. of 11 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Factor w/ 5 levels "Fair","Good",..: 3 4 2 4 2 5 5 5 1 5 ...
## $ color : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Factor w/ 8 levels "I1","IF","SI1",..: 4 3 5 6 4 8 7 3 6 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
The dataset contains 11 columns starting from X which seems to be S.No follow up by carat, cut, color, clarity, depth, table, price, x, y and z. The data set contains 53940 number of observations. The average price of diamonds is 3933 dollars and median price is 2401 dollars with the minimum price of $326 and maximum price of $53,940. The average mean carat is 0.7979 and median carat is 0.7 with minimum range of 0.2 and maximum range of 5.01.
aggregate(diamonds$price, by=list(diamonds$cut), FUN=mean, na.rm=TRUE)
## Group.1 x
## 1 Fair 4358.758
## 2 Good 3928.864
## 3 Ideal 3457.542
## 4 Premium 4584.258
## 5 Very Good 3981.760
aggregate(diamonds$price, by=list(diamonds$color), FUN=mean, na.rm=TRUE)
## Group.1 x
## 1 D 3169.954
## 2 E 3076.752
## 3 F 3724.886
## 4 G 3999.136
## 5 H 4486.669
## 6 I 5091.875
## 7 J 5323.818
aggregate(diamonds$price, by=list(diamonds$depth), FUN=mean, na.rm=TRUE)
## Group.1 x
## 1 43.0 4206.000
## 2 44.0 4032.000
## 3 50.8 6727.000
## 4 51.0 945.000
## 5 52.2 1895.000
## 6 52.3 1166.000
## 7 52.7 1293.000
## 8 53.0 2856.000
## 9 53.1 2815.000
## 10 53.2 2988.500
## 11 53.3 2855.000
## 12 53.4 2164.000
## 13 53.8 4790.000
## 14 54.0 1012.000
## 15 54.2 905.500
## 16 54.3 1352.000
## 17 54.4 1013.000
## 18 54.6 1011.000
## 19 54.7 2691.000
## 20 55.0 2319.500
## 21 55.1 2393.333
## 22 55.2 3479.833
## 23 55.3 3336.000
## 24 55.4 2867.500
## 25 55.5 1946.000
## 26 55.6 4800.000
## 27 55.8 3197.571
## 28 55.9 3995.778
## 29 56.0 3432.500
## 30 56.1 4501.333
## 31 56.2 5561.273
## 32 56.3 3049.529
## 33 56.4 2681.000
## 34 56.5 3858.231
## 35 56.6 2872.400
## 36 56.7 4919.000
## 37 56.8 4886.875
## 38 56.9 4645.346
## 39 57.0 4191.632
## 40 57.1 3394.542
## 41 57.2 3978.111
## 42 57.3 4230.600
## 43 57.4 3645.162
## 44 57.5 3481.739
## 45 57.6 5288.263
## 46 57.7 3879.082
## 47 57.8 4788.020
## 48 57.9 4986.679
## 49 58.0 4678.022
## 50 58.1 4500.777
## 51 58.2 5562.035
## 52 58.3 4261.887
## 53 58.4 4770.598
## 54 58.5 5218.732
## 55 58.6 5172.447
## 56 58.7 4909.576
## 57 58.8 4348.134
## 58 58.9 5036.162
## 59 59.0 4599.811
## 60 59.1 4367.402
## 61 59.2 4240.377
## 62 59.3 4772.441
## 63 59.4 4711.286
## 64 59.5 4284.459
## 65 59.6 4633.917
## 66 59.7 4075.138
## 67 59.8 4715.109
## 68 59.9 4489.675
## 69 60.0 3637.217
## 70 60.1 4548.991
## 71 60.2 4253.793
## 72 60.3 4317.510
## 73 60.4 3819.040
## 74 60.5 4255.520
## 75 60.6 4107.379
## 76 60.7 4058.586
## 77 60.8 3874.561
## 78 60.9 3566.256
## 79 61.0 3555.519
## 80 61.1 3843.393
## 81 61.2 3652.300
## 82 61.3 3564.140
## 83 61.4 3513.588
## 84 61.5 3751.722
## 85 61.6 3472.563
## 86 61.7 3527.523
## 87 61.8 3551.376
## 88 61.9 3498.736
## 89 62.0 3825.946
## 90 62.1 3571.067
## 91 62.2 3993.337
## 92 62.3 3894.251
## 93 62.4 4122.480
## 94 62.5 4096.372
## 95 62.6 4230.136
## 96 62.7 4303.221
## 97 62.8 4405.115
## 98 62.9 4130.585
## 99 63.0 4217.516
## 100 63.1 3872.365
## 101 63.2 3747.531
## 102 63.3 3857.152
## 103 63.4 3618.174
## 104 63.5 3849.479
## 105 63.6 4212.819
## 106 63.7 3830.281
## 107 63.8 4121.166
## 108 63.9 3939.952
## 109 64.0 3983.671
## 110 64.1 4458.853
## 111 64.2 4363.552
## 112 64.3 4080.646
## 113 64.4 4050.069
## 114 64.5 4341.745
## 115 64.6 4583.895
## 116 64.7 4377.384
## 117 64.8 5322.203
## 118 64.9 4670.864
## 119 65.0 5097.288
## 120 65.1 4265.119
## 121 65.2 4414.519
## 122 65.3 3870.293
## 123 65.4 5714.875
## 124 65.5 4164.333
## 125 65.6 4221.757
## 126 65.7 3555.692
## 127 65.8 5203.787
## 128 65.9 5419.250
## 129 66.0 4149.793
## 130 66.1 3900.379
## 131 66.2 3435.235
## 132 66.3 4740.040
## 133 66.4 3544.765
## 134 66.5 3757.087
## 135 66.6 4741.375
## 136 66.7 4427.786
## 137 66.8 4147.476
## 138 66.9 4484.435
## 139 67.0 4257.417
## 140 67.1 4449.800
## 141 67.2 5493.500
## 142 67.3 3038.308
## 143 67.4 5274.625
## 144 67.5 7276.200
## 145 67.6 4656.083
## 146 67.7 6257.333
## 147 67.8 3180.750
## 148 67.9 4343.750
## 149 68.0 4910.167
## 150 68.1 4024.667
## 151 68.2 3804.400
## 152 68.3 3563.167
## 153 68.4 2518.800
## 154 68.5 3800.400
## 155 68.6 2340.714
## 156 68.7 4816.250
## 157 68.8 4428.000
## 158 68.9 4048.000
## 159 69.0 4113.333
## 160 69.1 8736.000
## 161 69.2 1739.000
## 162 69.3 2523.000
## 163 69.4 5405.000
## 164 69.5 3864.000
## 165 69.6 5711.000
## 166 69.7 3590.750
## 167 69.8 3644.333
## 168 69.9 2117.000
## 169 70.0 5083.000
## 170 70.1 5446.000
## 171 70.2 7997.333
## 172 70.5 6860.000
## 173 70.6 8266.667
## 174 70.8 1020.500
## 175 71.0 613.000
## 176 71.2 1274.000
## 177 71.3 4368.000
## 178 71.6 2644.500
## 179 71.8 4455.000
## 180 72.2 2438.000
## 181 72.9 2691.000
## 182 73.6 1789.000
## 183 78.2 1262.000
## 184 79.0 2579.000
Data aggregation was conducted to identify the aggregated price of diamonds against the cut, color and depth of the diamond. As per the result in table, it seems that the price of diamonds vary with the cut. Although the prices are different but it does not make sense because the price of fair cut diamond is 4358.758 while diamond’s price with very good cut is 3981.76. Similarly, the price with different diamond color and depth also have different prices. The reason could be that there should be some other factors not contained in the data set or clarity or table which are very significant factor for change in price.
The first column in datset does not have any name so for the purpose of clarity (and data wrangling), name of first column would be kept as ‘S.No’.
Furthermore, we will check the types of price and cut to ensure those are converted into
colnames(diamonds)[1] <- 'S.No'
head(diamonds)
## S.No carat cut color clarity depth table price x y z
## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked _by_ '.GlobalEnv':
##
## diamonds
Now we are going to check the visualized pattern of price and carat on histogram
hist(diamonds$price, main="Price Histogram", xlab='Price')
hist(diamonds$carat, main="Carat Histogram", xlab= 'Carat')
We extracted the histogram on Price and Carat of the dataset to evaluate the data pattern. As per the histogram of price column, it is seen clearly that the price is right skewed and most of the price range falls on the left hand side. It means that the data is not normal and most prices in the data set are somehow less than 4000 with some of the data more than that. The result would not clearly identify the actual analysis of price with its determinants but we would evaluate the data as per the given situation. Similarly, the Carat histogram indicates that most diamond’s carat size are less than 2 and hence it would not be able to clearly identify the relationship with carat of more than 3. Again, the data is not normal and is right skewed.
Now, we will plot the price and carat in x axis and y axis respectively in the scatterplot to identify the relationship between these two variables. After looking at the basic scatterplot with these two variables, we will gradually check it with adding categories one by one.
ggplot(diamonds, aes(x=price, y=carat)) +geom_point()
ggplot(diamonds, aes(x=price, y=carat)) +geom_point(aes(color=color))
ggplot(diamonds, aes(x=price, y=carat)) +geom_point(aes(color=cut))
ggplot(diamonds, aes(x=price, y=carat)) +geom_point(aes(color=depth))
ggplot(diamonds, aes(x=price, y=carat)) +geom_point(aes(color=clarity))
ggplot(diamonds, aes(x=price, y=carat)) + geom_line()
##Conclusion
The price and carat were examined to identify their relationship with each other. Initially, price and carat were examined without adding any third factor and it seems that initially as the carat increases, price of the diamond also increases but at one point the growth is not too much and it tells that there could be other factors which might affect the relationship. As discussed before, the data of price and carat were right skewed and it could be the factor for unclear results. Later on, carat and price was examined with adding color of the diamond as a third variable. It still seems that price and carat are related to each other but one point the relationship gets weaker and it could be other factors which is influencing it. The relationship was also examined by adding clarity, depth and cut to see any changes but no significant change was seen.