Data Description and Source

Diamond is one of the hardest substances on Earth which makes it very costly. In 2008, the Diamond Search Engine has managed to collect data for the charactheristics and price of more than 50 thousands diamonds. The dataset used was obtained from https://www.kaggle.com/shivam2503/diamonds.

Goals

For this phase of the project, we will try to understand the data deeply by examining the relationships of the different factors. This analysis is the stage which will lead to a multinomial regression model which will see the effect of the cut of a diamond based on its different features. From general knowledge, we know that factors such as the diamond’s carat, the quality of the cut, the clarity and colour are factors which determine the price of a diamond. but we need to see how the categorical variables affect the dataset.

Dataset description

The exact amount of diamonds recorded in this dataset was 53940 under 10 different features. These are: * carat(numeric); the weight of the diamond * cut(ordinal); quality of the diamond cut(order:Fair, Good, Very Good, Premium, Ideal) * color(ordinal); the colour of the diamond rating from J(being worst) to D(being best) * clarity(ordinal); a measurement of how clear the diamond is (I1 (worst) to IF(best)) * depth(numeric); total depth percentage = z / mean(x, y) = 2 * z / (x + y) * table(numeric); width of top of diamond * price(numeric); in US dollars * x(numeric); length(mm) * y(numeric); width(mm) * z(numeric); depth(mm)


Data Cleaning & Pre-processing

Loading packages to be used in this project

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(corrplot)
## corrplot 0.84 loaded
library(RColorBrewer)

Loading Dataset

setwd("C:/Users/smj_l/Downloads/")
diamonds<- read_csv("diamonds.csv.zip")
## Warning: Missing column names filled in: 'X1' [1]
head(diamonds)
## # A tibble: 6 x 11
##      X1 carat cut       color clarity depth table price     x     y     z
##   <dbl> <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2     2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3     3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4     4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5     5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6     6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Firstly, we will verify if there is any missing values in the dataset and we can see that the answer is 0 but we still have to go through more processes before the data is clean.

colSums(is.na(diamonds))
##      X1   carat     cut   color clarity   depth   table   price       x       y 
##       0       0       0       0       0       0       0       0       0       0 
##       z 
##       0
str(diamonds)
## tibble [53,940 x 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ X1     : num [1:53940] 1 2 3 4 5 6 7 8 9 10 ...
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : chr [1:53940] "Ideal" "Premium" "Good" "Premium" ...
##  $ color  : chr [1:53940] "E" "E" "E" "I" ...
##  $ clarity: chr [1:53940] "SI2" "SI1" "VS1" "VS2" ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : num [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X1 = col_double(),
##   ..   carat = col_double(),
##   ..   cut = col_character(),
##   ..   color = col_character(),
##   ..   clarity = col_character(),
##   ..   depth = col_double(),
##   ..   table = col_double(),
##   ..   price = col_double(),
##   ..   x = col_double(),
##   ..   y = col_double(),
##   ..   z = col_double()
##   .. )

The preliminary analysis we can see the presence of an “ID” variable name as X1 in the loaded dataset. This variable will be dropped.

Also, the 3 categorical variables in the dataset are currently “characters” and will be converted to factors with the respective order.

drop <- c("X1")
diamonds <- diamonds[,!(names(diamonds) %in% drop)]

diamonds$cut <- factor(diamonds$cut,
                       levels = c("Fair","Good","Very Good","Premium","Ideal"))
diamonds$color <- factor(diamonds$color,
                         levels = c("D","E","F","G","H","I","J"))
diamonds$clarity <- factor(diamonds$clarity,
                           levels =c("FL","IF","VVS1","VVS2","VS1","VS2","SI1","SI2","I1","I2","I3")) 


sapply(diamonds, class)
##     carat       cut     color   clarity     depth     table     price         x 
## "numeric"  "factor"  "factor"  "factor" "numeric" "numeric" "numeric" "numeric" 
##         y         z 
## "numeric" "numeric"

A diamond have either length or width or even depth of value 0 is not possible. Therefore, these should be threated as missing values and will be ommitted.

x.table <- table(diamonds$x)
x.table[names(x.table)==0]
## 0 
## 8
y.table <- table(diamonds$y)
y.table[names(y.table)==0]
## 0 
## 7
z.table <- table(diamonds$z)
z.table[names(z.table)==0]
##  0 
## 20
diamonds$x[diamonds$x==0]<-NA
diamonds$y[diamonds$y==0]<-NA
diamonds$z[diamonds$z==0]<-NA

diamonds_N <- na.omit(diamonds)
dim(diamonds_N)
## [1] 53920    10

Before proceeding to the visualisation part of this project, we will need to reduce our dataset. A random sample of 5000 observations will be subset from the previously clean dataset.

sub_diamond <- diamonds_N[sample(1:nrow(diamonds_N),5000,replace = FALSE),]

Data Visualisation

Now that our dataset is clean, lets take a look at the different variables.

We will start with the coutn for categorical data

table(sub_diamond$cut) %>% knitr::kable(caption = "Quality of cut ")
Quality of cut
Var1 Freq
Fair 138
Good 425
Very Good 1155
Premium 1257
Ideal 2025
table(sub_diamond$color) %>% knitr::kable(caption = "Color of diamond")
Color of diamond
Var1 Freq
D 645
E 874
F 932
G 1026
H 784
I 494
J 245
table(sub_diamond$clarity) %>% knitr::kable(caption = "Clarity of diamond")
Clarity of diamond
Var1 Freq
FL 0
IF 156
VVS1 329
VVS2 506
VS1 739
VS2 1151
SI1 1201
SI2 844
I1 74
I2 0
I3 0

We can see that for “clarity”, the best which is FL and the 2 worst( I2 & I3) are not present in the dataset being used.

Now let’s take a look at the distribution of the two of most important variables being assessed.

ggplot(sub_diamond,aes(x=carat)) +
  geom_histogram(binwidth = 0.10,color="black", fill= "#CCFFFF")+
  theme_minimal() + ggtitle("Distibution of carat")

ggplot(sub_diamond,aes(x=price)) +
  geom_histogram(binwidth = 500,color="black", fill= "#FFCCFF") + 
  ggtitle("Distribution of diamond prices") +
  xlab("Diamond Price in USD") + ylab("Frequency") +
  theme_light() 

For instance, we can clearly notice that we have more diamonds of carat less than 1 and this is confirmed with the summary table below which shows a mean of 0.79. And also we see that we have many diamonds having a price of less than $2500 and its median is found to be $2421.

For an easy understanding of the numerical values, a descriptive statistic is generated

mydata <- sub_diamond[, c(1,5:10)]
summary(mydata)
##      carat            depth           table           price        
##  Min.   :0.2000   Min.   :43.00   Min.   :50.00   Min.   :  335.0  
##  1st Qu.:0.4000   1st Qu.:61.10   1st Qu.:56.00   1st Qu.:  939.8  
##  Median :0.7100   Median :61.90   Median :57.00   Median : 2423.0  
##  Mean   :0.7955   Mean   :61.75   Mean   :57.43   Mean   : 3955.3  
##  3rd Qu.:1.0400   3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5297.8  
##  Max.   :3.4000   Max.   :78.20   Max.   :79.00   Max.   :18803.0  
##        x               y               z        
##  Min.   :3.810   Min.   :3.780   Min.   :1.070  
##  1st Qu.:4.700   1st Qu.:4.710   1st Qu.:2.910  
##  Median :5.700   Median :5.720   Median :3.530  
##  Mean   :5.725   Mean   :5.728   Mean   :3.536  
##  3rd Qu.:6.530   3rd Qu.:6.520   3rd Qu.:4.030  
##  Max.   :9.420   Max.   :9.340   Max.   :6.270

Now lets take a look at the different relations between two variables.

ggplot(sub_diamond, aes(x=cut, y=carat, fill = cut)) +
  ggtitle("Diamonds cut by carat") + 
  geom_boxplot(outlier.colour="red", outlier.size=1) +
  theme_classic() + theme(axis.title.x=element_blank(),
                          axis.ticks.x=element_blank(),
                          axis.text.x=element_blank())

ggplot(sub_diamond, aes(x=cut, y=price, fill = cut)) +
  ggtitle("Diamonds cut by price ") + 
  geom_boxplot(outlier.colour="red", outlier.size=1) +
  theme_classic() + theme(axis.title.x=element_blank(),
                          axis.ticks.x=element_blank(),
                          axis.text.x=element_blank())

ggplot(sub_diamond, aes(x=color, y=price, fill = color)) +  
  ggtitle("Diamonds Color by price") +
  geom_boxplot(outlier.colour="red", outlier.size=1)+
  scale_fill_brewer()

ggplot(sub_diamond, aes(x=clarity, y=price, fill = clarity)) + 
  ggtitle("Diamonds Clarity by price") +
  geom_boxplot(outlier.colour="red", outlier.size=1)

ggplot(sub_diamond) +
  geom_bar(aes(x = cut,fill = clarity))+
  ggtitle("Cut quality stratified by clarity of diamonds")+
  scale_fill_brewer(palette = "Spectral")

Contrary to some beliefs, the diamond cut does not seem to have such a big effect on its price. But clarity and colour does make the differences when it comes to price and not in the way we would think of. In fact when we look at diamond clarity by price, the best clarity which is IF seems to have a lower interquartile price compare to I1(the worst clarity) but we got some outliers.

However, when we look at the count of different clarity under each cut quality, we do make the difference.

The next step is to look at the 3 variables at a time.

ggplot(sub_diamond) + geom_point(aes(x=carat,y=price,colour=cut)) +
  xlab("Carat") + ylab("Price in USD") + 
  ggtitle("Price Evolution as per diamonds' carat")+
  scale_color_brewer() + theme_dark()

ggplot(sub_diamond) + geom_point(aes(x=carat,y=price,colour=clarity)) +
  xlab("Carat") + ylab("Price in USD") + 
  ggtitle("Price Evolution as per diamonds' clarity")+
  scale_color_brewer(palette="Accent")

ggplot(sub_diamond) + geom_point(aes(x=carat,y=price,colour=color)) +
  xlab("Carat") + ylab("Price in USD") + 
  ggtitle("Price Evolution as per diamonds' colour")

The first plot is the comparison of carat against price but taking into consideration the cut of the diamond. An example to explain is by looking at the fair cut, which has trends to have a lower price when it comes to carat weight.

The second plot is almost the same as the first one exempt for the dot colour which represent the clarity of the diamond. Apart from some exceptionand taking into account the carat, best clarity gives a higher price and worst clarity is equivalent to a lower price.

The third plot seems quite a mess. Further explanation will be optain from the regression model.

Before ending this part of the project, lets take a look at the correlation of the other numerical variable.

resi <- cor(mydata)
round(resi,2)
##       carat depth table price     x     y    z
## carat  1.00  0.02  0.18  0.92  0.98  0.98 0.98
## depth  0.02  1.00 -0.30 -0.03 -0.03 -0.03 0.09
## table  0.18 -0.30  1.00  0.13  0.19  0.18 0.15
## price  0.92 -0.03  0.13  1.00  0.88  0.89 0.88
## x      0.98 -0.03  0.19  0.88  1.00  1.00 0.99
## y      0.98 -0.03  0.18  0.89  1.00  1.00 0.99
## z      0.98  0.09  0.15  0.88  0.99  0.99 1.00

We can definitely see how some variables are highly correlated to others. This will be more self explanatory in the Phase 2 of the project.