Project - Mispriced Diamonds

Description of the project

In the given dataset, all rows represent Diamond with three columns namley carat(weight), clarity, price(price at which diamond is sold)

we have more then 50000 transactions recorded in the given file. High clarity diamond is priced high? Does the relationship between price and clarity always hold true?

let us investigate

Step 1: load the data

library(ggplot2)
mydata<- read.csv(file.choose())

View(mydata)
str(mydata)
## 'data.frame':    53940 obs. of  3 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ clarity: chr  "SI2" "SI1" "VS1" "VS2" ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...

Step2: Ensure you have a right package for the project analysis- ggplot2

Step 3: Scatter plot

ggplot(data= mydata, aes(x= carat,y= price)) + geom_point()

This gives the scatter plot for diamonds. All the transactions are captured but there is clarity variable

Step 4: Now, we can add ‘clarity’ variable as color

ggplot (data= mydata, aes(x=carat, y=price, color= clarity))+ geom_point()

Step 5: The scatter plot has got many clumsy points and not so clear

ggplot (data = mydata,aes(x=carat, y= price,color= clarity))+ geom_point(alpha=0.1)

Alpha is used for clarity

Step 6: It seems to be clear but scatter plot is having all the records irrespective of significant and non significant.

Therefore let us filter out the data now with the condition on clarity.

ggplot(data= mydata[mydata$carat< 2.5,], aes(x=carat,y=price,color=clarity)) + geom_point(alpha=0.1)

Variable ‘carat’ records the only less than 2.5 values are considered in the visualization.

Step 7: To see the averagesfor variable ‘carat’ let us run the following line

ggplot(mydata[mydata$carat<2.5,], aes(x=carat,y=price,color=clarity))+ geom_point(alpha=0.1)+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Since the number of observations is too large, we use ggplot2.
Conclusion
1.From this plot we can observe that in the carat(1.0:1.5)we can see there is not much of the clash or intersection between all kind of clarity here.
2.From the plot we can observe that in the carat range(1.5:2.0) we can see that many clashes between different kind of clarity are the in which VVS2 is intersecting VVS1 and also IF price range. S12
3.From the scatter plot we can observe that carat range(2.0:2.5) we can see that the clarity wise pricing is not done because VS2 is intersecting many other clarity range such as SI1, SI2, VVS1 and there is down trend ahead, IF is also intersected by I1.

From the above statements we can see that our assumption as price is determined by the clarity of the diamonds but here we can see that various interception between different clarity range proves that the prices have be altered or influence, which defines our assumption of price is based on clarity.