Diamond is one of the hardest substances on Earth which makes it very costly. In 2008, the Diamond Search Engine has managed to collect data for the charactheristics and price of more than 50 thousands diamonds. The dataset used was obtained from https://www.kaggle.com/shivam2503/diamonds.
For this phase of the project, we will try to understand the data deeply by examining the relationships of the different factors. This analysis is the stage which will lead to a multinomial regression model which will see the effect of the cut of a diamond based on its different features. From general knowledge, we know that factors such as the diamond’s carat, the quality of the cut, the clarity and colour are factors which determine the price of a diamond. but we need to see how the categorical variables affect the dataset.
The exact amount of diamonds recorded in this dataset was 53940 under 10 different features. These are: * carat(numeric); the weight of the diamond * cut(ordinal); quality of the diamond cut(order:Fair, Good, Very Good, Premium, Ideal) * color(ordinal); the colour of the diamond rating from J(being worst) to D(being best) * clarity(ordinal); a measurement of how clear the diamond is (I1 (worst) to IF(best)) * depth(numeric); total depth percentage = z / mean(x, y) = 2 * z / (x + y) * table(numeric); width of top of diamond * price(numeric); in US dollars * x(numeric); length(mm) * y(numeric); width(mm) * z(numeric); depth(mm)
Loading packages to be used in this project
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(corrplot)
## corrplot 0.84 loaded
library(RColorBrewer)
Loading Dataset
setwd("C:/Users/smj_l/Downloads/")
diamonds<- read_csv("diamonds.csv.zip")
## Warning: Missing column names filled in: 'X1' [1]
head(diamonds)
## # A tibble: 6 x 11
## X1 carat cut color clarity depth table price x y z
## <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Firstly, we will verify if there is any missing values in the dataset and we can see that the answer is 0 but we still have to go through more processes before the data is clean.
colSums(is.na(diamonds))
## X1 carat cut color clarity depth table price x y
## 0 0 0 0 0 0 0 0 0 0
## z
## 0
str(diamonds)
## tibble [53,940 x 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ X1 : num [1:53940] 1 2 3 4 5 6 7 8 9 10 ...
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : chr [1:53940] "Ideal" "Premium" "Good" "Premium" ...
## $ color : chr [1:53940] "E" "E" "E" "I" ...
## $ clarity: chr [1:53940] "SI2" "SI1" "VS1" "VS2" ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : num [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## - attr(*, "spec")=
## .. cols(
## .. X1 = col_double(),
## .. carat = col_double(),
## .. cut = col_character(),
## .. color = col_character(),
## .. clarity = col_character(),
## .. depth = col_double(),
## .. table = col_double(),
## .. price = col_double(),
## .. x = col_double(),
## .. y = col_double(),
## .. z = col_double()
## .. )
The preliminary analysis we can see the presence of an “ID” variable name as X1 in the loaded dataset. This variable will be dropped.
Also, the 3 categorical variables in the dataset are currently “characters” and will be converted to factors with the respective order.
drop <- c("X1")
diamonds <- diamonds[,!(names(diamonds) %in% drop)]
diamonds$cut <- factor(diamonds$cut,
levels = c("Fair","Good","Very Good","Premium","Ideal"))
diamonds$color <- factor(diamonds$color,
levels = c("D","E","F","G","H","I","J"))
diamonds$clarity <- factor(diamonds$clarity,
levels =c("FL","IF","VVS1","VVS2","VS1","VS2","SI1","SI2","I1","I2","I3"))
sapply(diamonds, class)
## carat cut color clarity depth table price x
## "numeric" "factor" "factor" "factor" "numeric" "numeric" "numeric" "numeric"
## y z
## "numeric" "numeric"
A diamond have either length or width or even depth of value 0 is not possible. Therefore, these should be threated as missing values and will be ommitted.
x.table <- table(diamonds$x)
x.table[names(x.table)==0]
## 0
## 8
y.table <- table(diamonds$y)
y.table[names(y.table)==0]
## 0
## 7
z.table <- table(diamonds$z)
z.table[names(z.table)==0]
## 0
## 20
diamonds$x[diamonds$x==0]<-NA
diamonds$y[diamonds$y==0]<-NA
diamonds$z[diamonds$z==0]<-NA
diamonds_N <- na.omit(diamonds)
dim(diamonds_N)
## [1] 53920 10
Before proceeding to the visualisation part of this project, we will need to reduce our dataset. A random sample of 5000 observations will be subset from the previously clean dataset.
sub_diamond <- diamonds_N[sample(1:nrow(diamonds_N),5000,replace = FALSE),]
Now that our dataset is clean, lets take a look at the different variables.
We will start with the coutn for categorical data
table(sub_diamond$cut) %>% knitr::kable(caption = "Quality of cut ")
Var1 | Freq |
---|---|
Fair | 138 |
Good | 425 |
Very Good | 1155 |
Premium | 1257 |
Ideal | 2025 |
table(sub_diamond$color) %>% knitr::kable(caption = "Color of diamond")
Var1 | Freq |
---|---|
D | 645 |
E | 874 |
F | 932 |
G | 1026 |
H | 784 |
I | 494 |
J | 245 |
table(sub_diamond$clarity) %>% knitr::kable(caption = "Clarity of diamond")
Var1 | Freq |
---|---|
FL | 0 |
IF | 156 |
VVS1 | 329 |
VVS2 | 506 |
VS1 | 739 |
VS2 | 1151 |
SI1 | 1201 |
SI2 | 844 |
I1 | 74 |
I2 | 0 |
I3 | 0 |
We can see that for “clarity”, the best which is FL and the 2 worst( I2 & I3) are not present in the dataset being used.
Now let’s take a look at the distribution of the two of most important variables being assessed.
ggplot(sub_diamond,aes(x=carat)) +
geom_histogram(binwidth = 0.10,color="black", fill= "#CCFFFF")+
theme_minimal() + ggtitle("Distibution of carat")
ggplot(sub_diamond,aes(x=price)) +
geom_histogram(binwidth = 500,color="black", fill= "#FFCCFF") +
ggtitle("Distribution of diamond prices") +
xlab("Diamond Price in USD") + ylab("Frequency") +
theme_light()
For instance, we can clearly notice that we have more diamonds of carat less than 1 and this is confirmed with the summary table below which shows a mean of 0.79. And also we see that we have many diamonds having a price of less than $2500 and its median is found to be $2421.
For an easy understanding of the numerical values, a descriptive statistic is generated
mydata <- sub_diamond[, c(1,5:10)]
summary(mydata)
## carat depth table price
## Min. :0.2000 Min. :43.00 Min. :50.00 Min. : 335.0
## 1st Qu.:0.4000 1st Qu.:61.10 1st Qu.:56.00 1st Qu.: 939.8
## Median :0.7100 Median :61.90 Median :57.00 Median : 2423.0
## Mean :0.7955 Mean :61.75 Mean :57.43 Mean : 3955.3
## 3rd Qu.:1.0400 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5297.8
## Max. :3.4000 Max. :78.20 Max. :79.00 Max. :18803.0
## x y z
## Min. :3.810 Min. :3.780 Min. :1.070
## 1st Qu.:4.700 1st Qu.:4.710 1st Qu.:2.910
## Median :5.700 Median :5.720 Median :3.530
## Mean :5.725 Mean :5.728 Mean :3.536
## 3rd Qu.:6.530 3rd Qu.:6.520 3rd Qu.:4.030
## Max. :9.420 Max. :9.340 Max. :6.270
Now lets take a look at the different relations between two variables.
ggplot(sub_diamond, aes(x=cut, y=carat, fill = cut)) +
ggtitle("Diamonds cut by carat") +
geom_boxplot(outlier.colour="red", outlier.size=1) +
theme_classic() + theme(axis.title.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.x=element_blank())
ggplot(sub_diamond, aes(x=cut, y=price, fill = cut)) +
ggtitle("Diamonds cut by price ") +
geom_boxplot(outlier.colour="red", outlier.size=1) +
theme_classic() + theme(axis.title.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.x=element_blank())
ggplot(sub_diamond, aes(x=color, y=price, fill = color)) +
ggtitle("Diamonds Color by price") +
geom_boxplot(outlier.colour="red", outlier.size=1)+
scale_fill_brewer()
ggplot(sub_diamond, aes(x=clarity, y=price, fill = clarity)) +
ggtitle("Diamonds Clarity by price") +
geom_boxplot(outlier.colour="red", outlier.size=1)
ggplot(sub_diamond) +
geom_bar(aes(x = cut,fill = clarity))+
ggtitle("Cut quality stratified by clarity of diamonds")+
scale_fill_brewer(palette = "Spectral")
Contrary to some beliefs, the diamond cut does not seem to have such a big effect on its price. But clarity and colour does make the differences when it comes to price and not in the way we would think of. In fact when we look at diamond clarity by price, the best clarity which is IF seems to have a lower interquartile price compare to I1(the worst clarity) but we got some outliers.
However, when we look at the count of different clarity under each cut quality, we do make the difference.
The next step is to look at the 3 variables at a time.
ggplot(sub_diamond) + geom_point(aes(x=carat,y=price,colour=cut)) +
xlab("Carat") + ylab("Price in USD") +
ggtitle("Price Evolution as per diamonds' carat")+
scale_color_brewer() + theme_dark()
ggplot(sub_diamond) + geom_point(aes(x=carat,y=price,colour=clarity)) +
xlab("Carat") + ylab("Price in USD") +
ggtitle("Price Evolution as per diamonds' clarity")+
scale_color_brewer(palette="Accent")
ggplot(sub_diamond) + geom_point(aes(x=carat,y=price,colour=color)) +
xlab("Carat") + ylab("Price in USD") +
ggtitle("Price Evolution as per diamonds' colour")
The first plot is the comparison of carat against price but taking into consideration the cut of the diamond. An example to explain is by looking at the fair cut, which has trends to have a lower price when it comes to carat weight.
The second plot is almost the same as the first one exempt for the dot colour which represent the clarity of the diamond. Apart from some exceptionand taking into account the carat, best clarity gives a higher price and worst clarity is equivalent to a lower price.
The third plot seems quite a mess. Further explanation will be optain from the regression model.
Before ending this part of the project, lets take a look at the correlation of the other numerical variable.
resi <- cor(mydata)
round(resi,2)
## carat depth table price x y z
## carat 1.00 0.02 0.18 0.92 0.98 0.98 0.98
## depth 0.02 1.00 -0.30 -0.03 -0.03 -0.03 0.09
## table 0.18 -0.30 1.00 0.13 0.19 0.18 0.15
## price 0.92 -0.03 0.13 1.00 0.88 0.89 0.88
## x 0.98 -0.03 0.19 0.88 1.00 1.00 0.99
## y 0.98 -0.03 0.18 0.89 1.00 1.00 0.99
## z 0.98 0.09 0.15 0.88 0.99 0.99 1.00
We can definitely see how some variables are highly correlated to others. This will be more self explanatory in the Phase 2 of the project.