Read Data from local source

diamonds <-read.table (file = "C:/Users/Layla Habibullah/Desktop/diamonds.csv", header= TRUE, sep =",")

Introduction

Diamond is a stone which is very popular among the people for different purposes especially as jewelry. There are different categories of diamonds which makes it either cheaper or very expensive. Data set of diamond has been taken in this final project to identify the relationship between the price of diamonds with other factors such as carat, color, cut, clarity and depth. This project focuses on identifying factors causing changes in the price of diamonds in the current data set which is downloaded from data source online.

Research Questions

The aims of the project are: 1. To identify the basic statistics of the diamond dataset. For instance mean, median, number of rows and columns and quartiles with conclusion 2. To do data wrangling in the dataset to make it better (Data aggregation, changing column names and changing the class categories as per the requirement) 3. Exploration of the data through graphical representation to identify the relationship between price and carat with respect to color, cut and clarity 4. Conclude the findings

1. Summary of the Diamond dataset

head(diamonds)
##   X carat       cut color clarity depth table price    x    y    z
## 1 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
summary(diamonds)
##        X             carat               cut        color    
##  Min.   :    1   Min.   :0.2000   Fair     : 1610   D: 6775  
##  1st Qu.:13486   1st Qu.:0.4000   Good     : 4906   E: 9797  
##  Median :26971   Median :0.7000   Ideal    :21551   F: 9542  
##  Mean   :26971   Mean   :0.7979   Premium  :13791   G:11292  
##  3rd Qu.:40455   3rd Qu.:1.0400   Very Good:12082   H: 8304  
##  Max.   :53940   Max.   :5.0100                     I: 5422  
##                                                     J: 2808  
##     clarity          depth           table           price      
##  SI1    :13065   Min.   :43.00   Min.   :43.00   Min.   :  326  
##  VS2    :12258   1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950  
##  SI2    : 9194   Median :61.80   Median :57.00   Median : 2401  
##  VS1    : 8171   Mean   :61.75   Mean   :57.46   Mean   : 3933  
##  VVS2   : 5066   3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324  
##  VVS1   : 3655   Max.   :79.00   Max.   :95.00   Max.   :18823  
##  (Other): 2531                                                  
##        x                y                z         
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.700   Median : 5.710   Median : 3.530  
##  Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :10.740   Max.   :58.900   Max.   :31.800  
## 
str(diamonds)
## 'data.frame':    53940 obs. of  11 variables:
##  $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Factor w/ 5 levels "Fair","Good",..: 3 4 2 4 2 5 5 5 1 5 ...
##  $ color  : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Factor w/ 8 levels "I1","IF","SI1",..: 4 3 5 6 4 8 7 3 6 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

The dataset contains 11 columns starting from X which seems to be S.No follow up by carat, cut, color, clarity, depth, table, price, x, y and z. The data set contains 53940 number of observations. The average price of diamonds is 3933 dollars and median price is 2401 dollars with the minimum price of $326 and maximum price of $53,940. The average mean carat is 0.7979 and median carat is 0.7 with minimum range of 0.2 and maximum range of 5.01.

Data Wrangling

aggregate(diamonds$price, by=list(diamonds$cut), FUN=mean, na.rm=TRUE)
##     Group.1        x
## 1      Fair 4358.758
## 2      Good 3928.864
## 3     Ideal 3457.542
## 4   Premium 4584.258
## 5 Very Good 3981.760
aggregate(diamonds$price, by=list(diamonds$color), FUN=mean, na.rm=TRUE)
##   Group.1        x
## 1       D 3169.954
## 2       E 3076.752
## 3       F 3724.886
## 4       G 3999.136
## 5       H 4486.669
## 6       I 5091.875
## 7       J 5323.818
aggregate(diamonds$price, by=list(diamonds$depth), FUN=mean, na.rm=TRUE)
##     Group.1        x
## 1      43.0 4206.000
## 2      44.0 4032.000
## 3      50.8 6727.000
## 4      51.0  945.000
## 5      52.2 1895.000
## 6      52.3 1166.000
## 7      52.7 1293.000
## 8      53.0 2856.000
## 9      53.1 2815.000
## 10     53.2 2988.500
## 11     53.3 2855.000
## 12     53.4 2164.000
## 13     53.8 4790.000
## 14     54.0 1012.000
## 15     54.2  905.500
## 16     54.3 1352.000
## 17     54.4 1013.000
## 18     54.6 1011.000
## 19     54.7 2691.000
## 20     55.0 2319.500
## 21     55.1 2393.333
## 22     55.2 3479.833
## 23     55.3 3336.000
## 24     55.4 2867.500
## 25     55.5 1946.000
## 26     55.6 4800.000
## 27     55.8 3197.571
## 28     55.9 3995.778
## 29     56.0 3432.500
## 30     56.1 4501.333
## 31     56.2 5561.273
## 32     56.3 3049.529
## 33     56.4 2681.000
## 34     56.5 3858.231
## 35     56.6 2872.400
## 36     56.7 4919.000
## 37     56.8 4886.875
## 38     56.9 4645.346
## 39     57.0 4191.632
## 40     57.1 3394.542
## 41     57.2 3978.111
## 42     57.3 4230.600
## 43     57.4 3645.162
## 44     57.5 3481.739
## 45     57.6 5288.263
## 46     57.7 3879.082
## 47     57.8 4788.020
## 48     57.9 4986.679
## 49     58.0 4678.022
## 50     58.1 4500.777
## 51     58.2 5562.035
## 52     58.3 4261.887
## 53     58.4 4770.598
## 54     58.5 5218.732
## 55     58.6 5172.447
## 56     58.7 4909.576
## 57     58.8 4348.134
## 58     58.9 5036.162
## 59     59.0 4599.811
## 60     59.1 4367.402
## 61     59.2 4240.377
## 62     59.3 4772.441
## 63     59.4 4711.286
## 64     59.5 4284.459
## 65     59.6 4633.917
## 66     59.7 4075.138
## 67     59.8 4715.109
## 68     59.9 4489.675
## 69     60.0 3637.217
## 70     60.1 4548.991
## 71     60.2 4253.793
## 72     60.3 4317.510
## 73     60.4 3819.040
## 74     60.5 4255.520
## 75     60.6 4107.379
## 76     60.7 4058.586
## 77     60.8 3874.561
## 78     60.9 3566.256
## 79     61.0 3555.519
## 80     61.1 3843.393
## 81     61.2 3652.300
## 82     61.3 3564.140
## 83     61.4 3513.588
## 84     61.5 3751.722
## 85     61.6 3472.563
## 86     61.7 3527.523
## 87     61.8 3551.376
## 88     61.9 3498.736
## 89     62.0 3825.946
## 90     62.1 3571.067
## 91     62.2 3993.337
## 92     62.3 3894.251
## 93     62.4 4122.480
## 94     62.5 4096.372
## 95     62.6 4230.136
## 96     62.7 4303.221
## 97     62.8 4405.115
## 98     62.9 4130.585
## 99     63.0 4217.516
## 100    63.1 3872.365
## 101    63.2 3747.531
## 102    63.3 3857.152
## 103    63.4 3618.174
## 104    63.5 3849.479
## 105    63.6 4212.819
## 106    63.7 3830.281
## 107    63.8 4121.166
## 108    63.9 3939.952
## 109    64.0 3983.671
## 110    64.1 4458.853
## 111    64.2 4363.552
## 112    64.3 4080.646
## 113    64.4 4050.069
## 114    64.5 4341.745
## 115    64.6 4583.895
## 116    64.7 4377.384
## 117    64.8 5322.203
## 118    64.9 4670.864
## 119    65.0 5097.288
## 120    65.1 4265.119
## 121    65.2 4414.519
## 122    65.3 3870.293
## 123    65.4 5714.875
## 124    65.5 4164.333
## 125    65.6 4221.757
## 126    65.7 3555.692
## 127    65.8 5203.787
## 128    65.9 5419.250
## 129    66.0 4149.793
## 130    66.1 3900.379
## 131    66.2 3435.235
## 132    66.3 4740.040
## 133    66.4 3544.765
## 134    66.5 3757.087
## 135    66.6 4741.375
## 136    66.7 4427.786
## 137    66.8 4147.476
## 138    66.9 4484.435
## 139    67.0 4257.417
## 140    67.1 4449.800
## 141    67.2 5493.500
## 142    67.3 3038.308
## 143    67.4 5274.625
## 144    67.5 7276.200
## 145    67.6 4656.083
## 146    67.7 6257.333
## 147    67.8 3180.750
## 148    67.9 4343.750
## 149    68.0 4910.167
## 150    68.1 4024.667
## 151    68.2 3804.400
## 152    68.3 3563.167
## 153    68.4 2518.800
## 154    68.5 3800.400
## 155    68.6 2340.714
## 156    68.7 4816.250
## 157    68.8 4428.000
## 158    68.9 4048.000
## 159    69.0 4113.333
## 160    69.1 8736.000
## 161    69.2 1739.000
## 162    69.3 2523.000
## 163    69.4 5405.000
## 164    69.5 3864.000
## 165    69.6 5711.000
## 166    69.7 3590.750
## 167    69.8 3644.333
## 168    69.9 2117.000
## 169    70.0 5083.000
## 170    70.1 5446.000
## 171    70.2 7997.333
## 172    70.5 6860.000
## 173    70.6 8266.667
## 174    70.8 1020.500
## 175    71.0  613.000
## 176    71.2 1274.000
## 177    71.3 4368.000
## 178    71.6 2644.500
## 179    71.8 4455.000
## 180    72.2 2438.000
## 181    72.9 2691.000
## 182    73.6 1789.000
## 183    78.2 1262.000
## 184    79.0 2579.000

Data aggregation was conducted to identify the aggregated price of diamonds against the cut, color and depth of the diamond. As per the result in table, it seems that the price of diamonds vary with the cut. Although the prices are different but it does not make sense because the price of fair cut diamond is 4358.758 while diamond’s price with very good cut is 3981.76. Similarly, the price with different diamond color and depth also have different prices. The reason could be that there should be some other factors not contained in the data set or clarity or table which are very significant factor for change in price.

The first column in datset does not have any name so for the purpose of clarity (and data wrangling), name of first column would be kept as ‘S.No’.

Furthermore, we will check the types of price and cut to ensure those are converted into

colnames(diamonds)[1] <- 'S.No'
head(diamonds)
##   S.No carat       cut color clarity depth table price    x    y    z
## 1    1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2    2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3    3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4    4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5    5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6    6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

Data Exploration through graphical representation for the better visualization of comparison of price with carat with reference to cut, color and depth

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked _by_ '.GlobalEnv':
## 
##     diamonds

Now we are going to check the visualized pattern of price and carat on histogram

hist(diamonds$price, main="Price Histogram", xlab='Price')

hist(diamonds$carat, main="Carat Histogram", xlab= 'Carat')

We extracted the histogram on Price and Carat of the dataset to evaluate the data pattern. As per the histogram of price column, it is seen clearly that the price is right skewed and most of the price range falls on the left hand side. It means that the data is not normal and most prices in the data set are somehow less than 4000 with some of the data more than that. The result would not clearly identify the actual analysis of price with its determinants but we would evaluate the data as per the given situation. Similarly, the Carat histogram indicates that most diamond’s carat size are less than 2 and hence it would not be able to clearly identify the relationship with carat of more than 3. Again, the data is not normal and is right skewed.

Now, we will plot the price and carat in x axis and y axis respectively in the scatterplot to identify the relationship between these two variables. After looking at the basic scatterplot with these two variables, we will gradually check it with adding categories one by one.

ggplot(diamonds, aes(x=price, y=carat)) +geom_point()

ggplot(diamonds, aes(x=price, y=carat)) +geom_point(aes(color=color))

ggplot(diamonds, aes(x=price, y=carat)) +geom_point(aes(color=cut))

ggplot(diamonds, aes(x=price, y=carat)) +geom_point(aes(color=depth))

ggplot(diamonds, aes(x=price, y=carat)) +geom_point(aes(color=clarity))

ggplot(diamonds, aes(x=price, y=carat)) + geom_line()

##Conclusion

The price and carat were examined to identify their relationship with each other. Initially, price and carat were examined without adding any third factor and it seems that initially as the carat increases, price of the diamond also increases but at one point the growth is not too much and it tells that there could be other factors which might affect the relationship. As discussed before, the data of price and carat were right skewed and it could be the factor for unclear results. Later on, carat and price was examined with adding color of the diamond as a third variable. It still seems that price and carat are related to each other but one point the relationship gets weaker and it could be other factors which is influencing it. The relationship was also examined by adding clarity, depth and cut to see any changes but no significant change was seen.