The Diamond Challenge

The Diamond Challenge

The Diamond Industry

Column

The Dataset

The wholesale diamond dataset includes numerical data and categorical data. There are 407280 total diamonds in the dataset covering sales from 2010 to 2021. For each diamond sale we are given 11 attributes:

  • carat: weight of the diamond
  • cut: rating system quality of the cut (Fair, Good, Very Good, Premium, Ideal)
  • color: standardized color code from J (worst) to D (best)
  • clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
  • depth: percentation (0 to 100) relating the diamonds depth (top to bottom) with its width
  • table: percentation (0 to 100) relating the diamonds overall width to the width of the top part
  • price: what the diamond sold for in US dollars
  • x: length in milimeters
  • y: width in milimeters
  • z: height in milimeters
  • year: year of the sale

Row

The Task

  • Data Cleaning Check the data for correctness. Remove any data entries that appear problematic. Summarize the data that was corrupted and the problems encountered. An example of a problematic data entry would be a negative sale price, for example. These kinds of data quality problems are commonly found in real-world datasets.

  • Exploratory analysis and summary statistics Create plots, graphs, and other visuals to summarize interesting aspects of the dataset. This will give you and your company insights into the dataset.

  • Building models: price prediction The main goal of your work is to build a model to predict the price of diamond sales in 2022. That is, your model should take is input the attributes (carat, cut, color, clarity, depth, table, x, y, z, and year) and predict the price (either as a single number or a probabilistic range of values). Train a machine-learning algorithm to estimate the price of diamonds based on these attributes.

  • Using your price prediction models Congrats on building a sophisticated machine learning model to predict prices! Now its time to deploy it.

View Python Code

Data Cleaning

Cleaning the Data

Row

We dropped several null values located in the carat categories.

Row

We noticed several negative values within the cost(dollars) columns. It was decided that the these values would be transformed into positive values. Are reasoning is this could have been human error when entering the data.

Outliners were discovered within the length (mm), width (mm), height (mm) and cost(dollars) columns were also droped. Upon research height (mm) contained some corrupted data that could not physically be possible for a diamonds which were dropped.

Our final Cleaned Dataset

Exploratory Analysis

Exploratory Analysis

One of the most important things to do with a large data set is to plot every point. This allows us to have an insight on how we should best visualize the data. Below demonstrates all data points plotted with regards to the following variables: year, carat, and price.

Row

Figure 1: Relationship between Carat and Price with regards to Year

By looking at this graph, we can see that the data set is highly populated by points that correspond to the year 2021. We are unable to get a clear visualization of other points that might be overridden by the 2021 observations. Therefore, it would be in our best interest to obtain a random selection of data points in order to get a better visual of the different categories involved in this data set. Not only is figure 1 useful to determine how we should best visualize the data, but it also demonstrates some flaws in our data and how we should go about with our data cleaning process. Figure 1 displays price values below the 0 threshold. A negative price value does not exist, so we must consider this problem when it comes to the data cleaning process. To best visualization the data, plots were created with a random selection of 5,000 observations.

Figure 2: Distribution of Prices

Figure 2 demonstrates the distribution of prices. This plot illustrates a right skewed distribution. Most of the diamonds in our data set fall under a particular price range. In other words, most diamonds prices are valued below $5,000. There is a smaller quantity of diamonds that fall on the more expensive part of the spectrum. These high diamond prices are the reason for our skewed distribution as these high prices will shift the mean towards the right, making a positively skewed distribution. This plot in itself brings us to question of what specific characteristics affect the value of a diamond.

Figure 3: Relationship between Carat and Price with regards to Color

This visualization allows us visualize the relationship between the carat and color variables with regards to their respective price. We were interested to see if color was a major factor in the price of a diamond. According to figure 2, we can see that the distribution of diamonds within each color group is approximately similar. In other words, there seems to be a positive correlation between a diamonds carat and price with regards to their respective color group. By looking at this plot, we can see carat as a possible variable that has an influence on the price of the diamond. Though, there is a possibility that color may not be an impactful factor when determining price. In other words, no color group seems to display a significantly different distribution from one another. However, further analysis must be done in order to prove the possibility of outing color as a non-significant factor to the price of the diamond.

Figure 4: Relationship between Depth and Price with regards to Year

Another factor of interest was looking into how the depth of the diamond affects the relationship of our diamond data set. By doing so, we looked into the relationship of depth and price with regards to year. There are no unusual patterns that distinct one year from the other. In other words, each plot demonstrates a similar pattern to one another. Additionally, the pattern displayed in these plots do not display any signs of a linear relationship. In fact, the data points demonstrate a uniform distribution. Since each plot displays a similar pattern, there is reason to assume that the price of a diamond is affected by a specific depth range. However, we cannot say that certain depth measurements lead to high diamond prices as there are diamonds with the same depth that are associated with different prices. Therefore, this plot emphasizes the variable depth may not be a major influence in the price of diamonds.

Figure 5: Relationship between Depth and Price with regards to Cut

The relationship between price and depth with regards to cut was also explored. A difference in the relationship between these variables depth and price was noted when including cut. As shown by the plot, there is more variance or disparity when the cut of the diamond is fair compared to the other diamond cuts. More variance between observations can also be seen under the good category, but as much variance compared to the fair category. Differences in how the observation are displayed in these plots provides insight about cut being a possible variable impacting the price for diamonds. Though, we cannot be certain of this claim as the observations displayed in the fair and good category do not display an obvious relationship between the depth and price of a diamond.

Figures 6, 7, 8

The figure 6, figure 7, and figure 8 display the relationship of price and different lengths of a diamond with regards to the cut of the diamond. Figure 6 and figure 7 display a positive relationship between the length and width of a diamond against its price. This relationship is true for each diamond cut. These two graphs shows us that a higher length and width leads to an increase in the diamond’s price. Based on the graphs, these two attributes seem to be promising characteristics in determining the price of a diamond. However, a different story is portrayed when looking at the relationship between height and price with regards to cut. Figure 8 portrays a steep distribution between the relationship of these variables. The price does not increase as height increase. Rather, most diamonds have height measurements that are approximately close to each other. Regardless of height, different prices vary for each of these diamonds. Therefore, there is a possibility that height might not be of our interest when looking into our analysis of the data.

Price Prediction

Training Model

From our Exploratory Analysis we know that the price of the diamonds is affected by the carat.So we created a regression model that predict price by only using one feature. We divide our dataset by train data and test data to make sure that R produce the same random number to make this report is reproducible.Now that our dataset is randomly ordered, we can split the first 80% of it into a training set, and the last 20% into a test set.

We notice the followings: diamond price is almost linearly correlated with x, y, z and carat; These are the critical factors driving price diamond price seems related to cut/color/clarity but is not very clear from this plot diamond price seems not directly related to depth and table

Prediction Model

We used our training dataset to predict the price using the lm(y~x) function, where y is the outcome variable and x is the explanatory variable. Price is the outcome and carat is the predictor variable. We used our domain knowledge of diamonds and carat weight to take the cube root of carat weight (volume).


Call:
lm(formula = price ~ . - index, data = Train_diamond)

Residuals:
     Min       1Q   Median       3Q      Max 
-21833.4   -686.9   -192.7    464.1  14621.2 

Coefficients:
               Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  -1.731e+05  1.449e+03 -119.479  < 2e-16 ***
carat         1.250e+04  2.522e+01  495.736  < 2e-16 ***
cutGood       5.480e+02  1.728e+01   31.711  < 2e-16 ***
cutIdeal      8.482e+02  1.713e+01   49.501  < 2e-16 ***
cutPremium    7.672e+02  1.642e+01   46.723  < 2e-16 ***
cutVery Good  7.146e+02  1.668e+01   42.839  < 2e-16 ***
colorE       -2.255e+02  9.041e+00  -24.940  < 2e-16 ***
colorF       -2.900e+02  9.153e+00  -31.688  < 2e-16 ***
colorG       -5.293e+02  8.920e+00  -59.342  < 2e-16 ***
colorH       -1.073e+03  9.466e+00 -113.354  < 2e-16 ***
colorI       -1.621e+03  1.070e+01 -151.482  < 2e-16 ***
colorJ       -2.658e+03  1.316e+01 -201.981  < 2e-16 ***
clarityIF     5.960e+03  2.599e+01  229.284  < 2e-16 ***
claritySI1    4.098e+03  2.218e+01  184.781  < 2e-16 ***
claritySI2    3.029e+03  2.226e+01  136.104  < 2e-16 ***
clarityVS1    5.121e+03  2.265e+01  226.091  < 2e-16 ***
clarityVS2    4.755e+03  2.230e+01  213.253  < 2e-16 ***
clarityVVS1   5.577e+03  2.390e+01  233.306  < 2e-16 ***
clarityVVS2   5.525e+03  2.331e+01  237.008  < 2e-16 ***
depth        -7.391e+01  2.220e+00  -33.298  < 2e-16 ***
table        -2.673e+01  1.466e+00  -18.229  < 2e-16 ***
x            -1.530e+03  3.945e+01  -38.774  < 2e-16 ***
y             4.267e+02  3.937e+01   10.837  < 2e-16 ***
z            -5.291e+01  1.347e+01   -3.927 8.62e-05 ***
year          8.713e+01  7.118e-01  122.405  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1399 on 324163 degrees of freedom
Multiple R-squared:  0.9034,    Adjusted R-squared:  0.9034 
F-statistic: 1.263e+05 on 24 and 324163 DF,  p-value: < 2.2e-16

Call:
lm(formula = price ~ carat + cut + color + clarity + depth + 
    table, data = diamonds)

Residuals:
   Min     1Q Median     3Q    Max 
-22912   -779   -183    551  14634 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -5076.394    170.838  -29.71   <2e-16 ***
carat         9888.529      5.535 1786.70   <2e-16 ***
cutGood        621.157     15.889   39.09   <2e-16 ***
cutIdeal       931.105     15.796   58.95   <2e-16 ***
cutPremium     832.807     15.240   54.65   <2e-16 ***
cutVery Good   818.129     15.235   53.70   <2e-16 ***
colorE        -227.901      8.413  -27.09   <2e-16 ***
colorF        -325.789      8.518  -38.25   <2e-16 ***
colorG        -557.659      8.301  -67.18   <2e-16 ***
colorH       -1068.107      8.810 -121.24   <2e-16 ***
colorI       -1588.028      9.950 -159.60   <2e-16 ***
colorJ       -2595.897     12.265 -211.66   <2e-16 ***
clarityIF     6036.781     24.155  249.91   <2e-16 ***
claritySI1    4005.870     20.604  194.43   <2e-16 ***
claritySI2    2953.398     20.693  142.72   <2e-16 ***
clarityVS1    5074.355     21.057  240.98   <2e-16 ***
clarityVS2    4706.052     20.732  226.99   <2e-16 ***
clarityVVS1   5645.079     22.229  253.95   <2e-16 ***
clarityVVS2   5553.129     21.679  256.15   <2e-16 ***
depth          -26.356      1.866  -14.13   <2e-16 ***
table          -23.820      1.363  -17.47   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1456 on 405211 degrees of freedom
Multiple R-squared:  0.8953,    Adjusted R-squared:  0.8953 
F-statistic: 1.733e+05 on 20 and 405211 DF,  p-value: < 2.2e-16

Residuals: The maximum error suggests that the model under predicted the price by for at least one observation.

P-value: small p-values suggest that features are extremely unlikely to have no relationship to the dependent variable.

Multiple R-squared value of 0.89 suggests that the model explains approx. 89% of the variation in the dependent variable.

Given the above, the model is performing well.

Predict Price

After fitting our model by training it, we can now predict with new data with 94% confidence.

[1] 0.9462151

Test Prediction Model

In order to test our prediction model we select random varaibles to test the prediction model:

carat = 3.5

cut = “Very Good”

color = “E”

clarity = “SI1”

depth = 65

table = 70

    1 
30749 

Prediction Model

Click Here for Team Data 4 Life Diamond Prediction Model

About Us

About Us

Row

Please right click names in order to view LinkedIn profiles

Team Lead

Team

---
title: "Team #Data 4 Life"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    theme: cosmo
    source: embed
    vertical_layout: fill
---
```{r setup, include=FALSE}
library(flexdashboard)
library(plotly)
library(readr)
library(rmarkdown)
library(knitr)
library(tidyverse)
library(skimr)
library(GGally)
library(corrr)
library(corrplot)
library(ggridges)
library(viridis)
library(hrbrthemes)
library(ggpubr)
library(moderndive)
library(olsrr)
library(MASS)
library(car)
library(lmtest)
library(cowplot)
library(caret)

wholesale_diamonds_clean <- read_csv("wholesale_diamonds_clean2.csv")
wholesale_diamonds_clean

wholesale_diamonds_clean
wholesale_diamonds_clean2 <- na.omit(wholesale_diamonds_clean)
wholesale_diamonds_clean2
diamonds <- wholesale_diamonds_clean2
diamond_missing_carat <- read_csv("diamond_missing_carat.csv")
diamond_negative_price <- read_csv("diamonds_negative_price.csv")
```


The Diamond Challenge {data-orientation=columns data-icon="fa-diamond"}
=====================================
> The Diamond Challenge

```{r}
knitr::include_graphics("diamonds.png")
```

The Diamond Industry

Column {data-width=300}
-----------------------------------------------------------------------

### The Dataset
The wholesale diamond dataset includes numerical data and categorical data. There are 407280 total diamonds in the dataset covering sales from 2010 to 2021. For each diamond sale we are given 11 attributes:

  * carat: weight of the diamond 
  * cut: rating system quality of the cut (Fair, Good, Very Good, Premium, Ideal)
  * color: standardized color code from J (worst) to D (best)
  * clarity:  a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
  * depth: percentation (0 to 100) relating the diamonds depth (top to bottom) with its width
  * table: percentation (0 to 100) relating the diamonds overall width to the width of the top part
  * price: what the diamond sold for in US dollars
  * x: length in milimeters 
  * y: width in milimeters
  * z: height in milimeters
  * year: year of the sale


Row {data-width=300}
-----------------------------------------------------------------------

### The Task
* Data Cleaning
Check the data for correctness. Remove any data entries that appear problematic. Summarize the data that was corrupted and the problems encountered. An example of a problematic data entry would be a negative sale price, for example. These kinds of data quality problems are commonly found in real-world datasets.

* Exploratory analysis and summary statistics
Create plots, graphs, and other visuals to summarize interesting aspects of the dataset. This will give you and your company insights into the dataset.

* Building models: price prediction
The main goal of your work is to build a model to predict the price of diamond sales in 2022. That is, your model should take is input the attributes (carat, cut, color, clarity, depth, table, x, y, z, and year) and predict the price (either as a single number or a probabilistic range of values). Train a machine-learning algorithm to estimate the price of diamonds based on these attributes.

* Using your price prediction models
Congrats on building a sophisticated machine learning model to predict prices! Now its time to deploy it.

[View Python Code](https://github.com/myhe2/Diamonds-Python-) 

Data Cleaning {data-orientation=column data-icon="fa-diamond"}
=====================================

### Cleaning the Data
Row {.tabset .tabset-fade}
-----------------------------------------------------------------------

```{r}
paged_table(diamond_missing_carat)
```
We dropped several null values located in the carat categories.

Row {.tabset .tabset-fade}
-----------------------------------------------------------------------

```{r}
paged_table(diamond_negative_price)
```
We noticed several negative values within the cost(dollars) columns. It was decided that the these values would be transformed into positive values. Are reasoning is this could have been human error when entering the data.

Outliners were discovered within the length (mm), width (mm), height (mm) and cost(dollars) columns were also droped.
Upon research height (mm) contained some corrupted data that could not physically be possible for a diamonds which were dropped.

> Our final Cleaned Dataset

```{r}
paged_table(wholesale_diamonds_clean2)
```

Exploratory Analysis {data-icon="fa-diamond"}
=====================================

### Exploratory Analysis

One of the most important things to do with a large data set is to plot every point. This allows us to have an insight on how we should best visualize the data. Below demonstrates all data points plotted with regards to the following variables: year, carat, and price.


Row {data-height=400 .tabset .tabset-fade .colored }
-----------------------------------------------------------------------


> Figure 1: Relationship between Carat and Price with regards to Year 

```{r}
knitr::include_graphics("figure1.png")
```

By looking at this graph, we can see that the data set is highly populated by points that correspond to the year 2021. We are unable to get a clear visualization of other points that might be overridden by the 2021 observations. Therefore, it would be in our best interest to obtain a random selection of data points in order to get a better visual of the different categories involved in this data set. Not only is figure 1 useful to determine how we should best visualize the data, but it also demonstrates some flaws in our data and how we should go about with our data cleaning process. Figure 1 displays price values below the 0 threshold. A negative price value does not exist, so we must consider this problem when it comes to the data cleaning process. To best visualization the data, plots were created with a random selection of 5,000 observations. 

> Figure 2: Distribution of Prices

```{r}
knitr::include_graphics("figure2.png")
```

Figure 2 demonstrates the distribution of prices. This plot illustrates a right skewed distribution. Most of the diamonds in our data set fall under a particular price range. In other words, most diamonds prices are valued below $5,000. There is a smaller quantity of diamonds that fall on the more expensive part of the spectrum. These high diamond prices are the reason for our skewed distribution as these high prices will shift the mean towards the right, making a positively skewed distribution. This plot in itself brings us to question of what specific characteristics affect the value of a diamond. 

>Figure 3: Relationship between Carat and Price with regards to Color

```{r}
knitr::include_graphics("figure3.png")
```

This visualization allows us visualize the relationship between the carat and color variables with regards to their respective price. We were interested to see if color was a major factor in the price of a diamond. According to figure 2, we can see that the distribution of diamonds within each color group is approximately similar. In other words, there seems to be a positive correlation between a diamonds carat and price with regards to their respective color group. By looking at this plot, we can see carat as a possible variable that has an influence on the price of the diamond. Though, there is a possibility that color may not be an impactful factor when determining price. In other words, no color group seems to display a significantly different distribution from one another. However, further analysis must be done in order to prove the possibility of outing color as a non-significant factor to the price of the diamond. 

> Figure 4: Relationship between Depth and Price with regards to Year 

```{r}
knitr::include_graphics("figure4.png")
```

Another factor of interest was looking into how the depth of the diamond affects the relationship of our diamond data set. By doing so, we looked into the relationship of depth and price with regards to year. There are no unusual patterns that distinct one year from the other. In other words, each plot demonstrates a similar pattern to one another. Additionally, the pattern displayed in these plots do not display any signs of a linear relationship. In fact, the data points demonstrate a uniform distribution. Since each plot displays a similar pattern, there is reason to assume that the price of a diamond is affected by a specific depth range. However, we cannot say that certain depth measurements lead to high diamond prices as there are diamonds with the same depth that are associated with different prices. Therefore, this plot emphasizes the variable depth may not be a major influence in the price of diamonds. 

> Figure 5: Relationship between Depth and Price with regards to Cut

```{r}
knitr::include_graphics("figure5.png")
```

The relationship between price and depth with regards to cut was also explored. A difference in the relationship between these variables depth and price was noted when including cut. As shown by the plot, there is more variance or disparity when the cut of the diamond is fair compared to the other diamond cuts. More variance between observations can also be seen under the good category, but as much variance compared to the fair category. Differences in how the observation are displayed in these plots provides insight about cut being a possible variable impacting the price for diamonds. Though, we cannot be certain of this claim as the observations displayed in the fair and good category do not display an obvious relationship between the depth and price of a diamond. 

> Figures 6, 7, 8

```{r}
knitr::include_graphics(rep(c("figure6.png", "figure7.png", "figure8.png"),1))
```

The figure 6, figure 7, and figure 8 display the relationship of price and different lengths of a diamond with regards to the cut of the diamond. Figure 6 and figure 7 display a positive relationship between the length and width of a diamond against its price. This relationship is true for each diamond cut. These two graphs shows us that a higher length and width leads to an increase in the diamond’s price. Based on the graphs, these two attributes seem to be promising characteristics in determining the price of a diamond. However, a different story is portrayed when looking at the relationship between height and price with regards to cut. Figure 8 portrays a steep distribution between the relationship of these variables. The price does not increase as height increase. Rather, most diamonds have height measurements that are approximately close to each other. Regardless of height, different prices vary for each of these diamonds. Therefore, there is a possibility that height might not be of our interest when looking into our analysis of the data. 

Price Prediction {data-icon="fa-diamond"}
=====================================

### Training Model

From our Exploratory Analysis we know that the price of the diamonds is affected by the carat.So we created a regression model that predict price by only using one feature. We divide our dataset by train data and test data to make sure that R produce the same random number to make this report is reproducible.Now that our dataset is randomly ordered, we can split the first 80% of it into a training set, and the last 20% into a test set.

```{r}

set.seed(123)

index_diamond <- createDataPartition(diamonds$price, p=0.8, list=FALSE)
Train_diamond <- diamonds[index_diamond,]
Test_diamond <- diamonds[-index_diamond,]

ggpairs(Train_diamond[4000:6000,])

```

We notice the followings:
diamond price is almost linearly correlated with x, y, z and carat; These are the critical factors driving price
diamond price seems related to cut/color/clarity but is not very clear from this plot
diamond price seems not directly related to depth and table

### Prediction Model

We used our training dataset to predict the price using the lm(y~x) function, where y is the outcome variable and x is the explanatory variable. Price is the outcome and carat is the predictor variable. We used our domain knowledge of diamonds and carat weight to take the cube root of carat weight (volume).

```{r}
model1 <- lm(price~.-index, data = Train_diamond)
summary(model1)
```

```{r}
tr_model <- lm(price ~ carat + cut + color + clarity + depth + table, data = diamonds)
summary(tr_model)
```

Residuals: The maximum error suggests that the model under predicted the price by for at least one observation.

P-value: small p-values suggest that features are extremely unlikely to have no relationship to the dependent variable.

Multiple R-squared value of 0.89 suggests that the model explains approx. 89% of the variation in the dependent variable.

Given the above, the model is performing well.

### Predict Price

After fitting our model by training it, we can now predict with new data with 94% confidence.

```{r}
diamonds$pred <- predict(tr_model, diamonds)
cor(diamonds$pred, diamonds$price)
```


```{r}
plot(diamonds$pred, diamonds$price)
abline(a=0, b=1, col="red", lwd=3, lty=2)
```

### Test Prediction Model

In order to test our prediction model we select random varaibles to test the prediction model:

carat = 3.5

cut = "Very Good"

color = "E"

clarity = "SI1"

depth = 65

table = 70

```{r}
properties <- data.frame(
                carat = 3.5,
                cut = "Very Good",
                color = "E",
                clarity = "SI1",
                depth = 65,
                table = 70
                )

print(predict(tr_model, properties))
```


Prediction Model {data-icon="fa-diamond"}
=====================================

[Click Here for Team Data 4 Life Diamond Prediction Model](https://hackdiversity.shinyapps.io/The_Diamond_Challenge/)

###
```{r}
knitr::include_graphics("model.png")
```

About Us {data-icon="fa-address-card"}
=====================================

### About Us

```{r}
knitr::include_graphics("D4L.png")
```

Row {.tabset .tabset-fade}
-----------------------------------------------------------------------
Please right click names in order to view LinkedIn profiles

**Team Lead** 

* [Nehemie Joseph](https://www.linkedin.com/in/nehemiejoseph0/)

```{r}
knitr::include_graphics("Neh.jpeg")
```

**Team**

* [Ayomide Abioye](https://www.linkedin.com/in/ayomide-joseph-abioye-a78296166/)

```{r}
knitr::include_graphics("Ayo.jpeg")
```


* [Diana  Hernandez](https://www.linkedin.com/in/diana-hernandez-2524a8191/) 

```{r}
knitr::include_graphics("Diana.jpeg")
```

* [Duamell Gomez]()

```{r}
knitr::include_graphics("Duamell.jpeg")
```

* [Jordan Clark](https://www.linkedin.com/in/jordaneclark)

```{r}
knitr::include_graphics("Jordan.jpeg")
```