Introduction

This document compromises one of the data science project done for price prediction. Data set source is from public domain and can be found here.

Primary objectives are to demonstrate following knowledge :

1.Explore, understand, cleanse & reorganized data with explained reasoning (Exploratory Data Analysis).

  1. Complete visualizations to identify and exploit noticeable trends and patterns.

  2. Highlight the correlation between the dependent variable (Price) with the independent and explain how they interact with each other.

  3. Test harness & apply 5 different machine learning algorithms to our analysis,interpret the significance of our result through statistical analysis by choosing to test one specific model.

  4. Test the chosen model & define if whether or not the designed models a good fit to predict price based on the cut of our diamond data set.

Content

  1. Overview
  2. Data Dictionary
  3. Data Cleansing & Preparation
  4. Visualization & Correlation Analysis
  5. Apply algorithms & Find Accurate Model
  6. Testing The Model & Draw Conclusion

Overview

The bigger, the more is the diamond expensive, and in some sense this might be accurate. However, we all might have praised our friend’s rind or necklace seeing that water clear looking diamond, is it really a good indicator of its value ?

In this project, use case is determine how all variables (carat, depth, table) interact with its other to explain our dependent variable, the price.

At the end, we will test algorithms to decide if whether or not we can predict diamonds prices based on the cut only.

Data Dictionnary

Data Cleansing

library('lattice')
library('ggplot2')
library('caret')
library('corrplot')
## corrplot 0.92 loaded
library("PerformanceAnalytics")
## Le chargement a nécessité le package : xts
## Le chargement a nécessité le package : zoo
## 
## Attachement du package : 'zoo'
## Les objets suivants sont masqués depuis 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attachement du package : 'PerformanceAnalytics'
## L'objet suivant est masqué depuis 'package:graphics':
## 
##     legend
library('randomForest')
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attachement du package : 'randomForest'
## L'objet suivant est masqué depuis 'package:ggplot2':
## 
##     margin
data <- read.csv("diamonds.csv")
head(data)
##   X carat       cut color clarity depth table price    x    y    z
## 1 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

Define the size by proceeding to calculation & get rid of the rows we do not include in our analysis, the variables that we are unable to include and cannot be considered are independent variables.

data$size <- data$x * data$y * data$z

data$X <- NULL
data$color <- NULL
data$x <- NULL
data$y <- NULL
data$z <- NULL
data$clarity <- NULL
summary(data)
##      carat            cut                depth           table      
##  Min.   :0.2000   Length:53940       Min.   :43.00   Min.   :43.00  
##  1st Qu.:0.4000   Class :character   1st Qu.:61.00   1st Qu.:56.00  
##  Median :0.7000   Mode  :character   Median :61.80   Median :57.00  
##  Mean   :0.7979                      Mean   :61.75   Mean   :57.46  
##  3rd Qu.:1.0400                      3rd Qu.:62.50   3rd Qu.:59.00  
##  Max.   :5.0100                      Max.   :79.00   Max.   :95.00  
##      price            size        
##  Min.   :  326   Min.   :   0.00  
##  1st Qu.:  950   1st Qu.:  65.14  
##  Median : 2401   Median : 114.81  
##  Mean   : 3933   Mean   : 129.85  
##  3rd Qu.: 5324   3rd Qu.: 170.84  
##  Max.   :18823   Max.   :3840.60

Create a list of 80% of the rows in the original data set we can use for training.

validation_index <- createDataPartition(data$cut, p=0.80, list=FALSE)

Select 20% of the data for validation.

validation <- data[-validation_index,]

Use the remaining 80% of data to training and testing the models.

df <- data[validation_index,]

Take a look at the dimensions of the data set and the types of the attributes to help summarize the data before modeling it.

dim(df)
## [1] 43153     6
sapply(df,class)
##       carat         cut       depth       table       price        size 
##   "numeric" "character"   "numeric"   "numeric"   "integer"   "numeric"

Change the cut column to factor by replacing it with “diamond_cut” and bind this new column with the old data frame to create “df2”.

diamond_cut <- as.factor(df$cut)
df2 <- cbind(df, diamond_cut)
head(df2)
##   carat       cut depth table price     size diamond_cut
## 1  0.23     Ideal  61.5    55   326 38.20203       Ideal
## 3  0.23      Good  56.9    65   327 38.07688        Good
## 5  0.31      Good  63.3    58   335 51.91725        Good
## 6  0.24 Very Good  62.8    57   336 38.69395   Very Good
## 7  0.24 Very Good  62.3    57   336 38.83087   Very Good
## 8  0.26 Very Good  61.9    55   337 42.32108   Very Good
df2$cut <- NULL

To prepare the models, let’s look at the distribution of the different levels of cut of the diamond data set.

percentage <- prop.table(table(df2$diamond_cut)) * 100
cbind(freq=table(df2$diamond_cut), percentage=percentage)
##            freq percentage
## Fair       1288   2.984729
## Good       3925   9.095544
## Ideal     17241  39.953190
## Premium   11033  25.567168
## Very Good  9666  22.399370

Finally summarize it.

summary(df2)
##      carat            depth           table           price      
##  Min.   :0.2000   Min.   :43.00   Min.   :43.00   Min.   :  326  
##  1st Qu.:0.4000   1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  954  
##  Median :0.7000   Median :61.80   Median :57.00   Median : 2409  
##  Mean   :0.7999   Mean   :61.75   Mean   :57.46   Mean   : 3950  
##  3rd Qu.:1.0400   3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5351  
##  Max.   :5.0100   Max.   :78.20   Max.   :95.00   Max.   :18823  
##       size           diamond_cut   
##  Min.   :   0.0   Fair     : 1288  
##  1st Qu.:  65.3   Good     : 3925  
##  Median : 114.9   Ideal    :17241  
##  Mean   : 130.2   Premium  :11033  
##  3rd Qu.: 171.1   Very Good: 9666  
##  Max.   :3840.6

Visualisation & Correlation Analysis

Now that we have reorganized our data, we want to extend that with visualizations and finding correlations between each variables.

Plot box visualization

First, let’s look how the price of diamonds evolve for each different levels of cut as the carat increases.

ggplot(df2, aes(x=carat, y=price, fill=diamond_cut)) + 
  geom_boxplot() + geom_smooth(linetype = "dashed")+
  facet_wrap(~diamond_cut)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Interesting results. Except “fair” cut, the price of diamonds reaches its peak between 2 & 3 carats and then decreases or stabilizes. “Fair” cut diamond prices keep increasing as the carat keeps increasing to reach around $20000 at 5 carats. Even though the quality is the lowest on the grid, its price increases dramatically passed other references peak.

Recall the distribution of diamond regarding the levels of cut.

plot(percentage)

Correlation plots between variables

Now we can look at the interactions between the variables. First, let’s draw a big picture using the numerical variables are or disposal.

cor.result <- cor(df2[,c(1,2,3,4)])
corrplot(cor.result, method = 'number')

We spot a strong correlation between carat and price, the prices increase as the carats increase. The depth explains poorly the price and show an almost unexisting correlation. Table & carat have a very small positive correlation. Table and depth have a small negative correlation, we can suppose that as we move through tables, depth gradually decreases.

But here is more detailed multivariate plots visualization at 10%, 5% & 1% level of significance.

data <- df2[, 1:4] # Numerical variables
cut <- df2[, 5] # Factor variable (groups)
chart.Correlation(data, histogram = TRUE, method = "pearson")
## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

We can also see the difference in distribution of each cuts by class value. We can see the Gaussian-like distribution (bell curve) of each attribute through density plot.

ggplot(df, aes(x=price, fill = diamond_cut)) +
  geom_density(alpha=0.4) + 
  geom_vline(data=df2, aes(xintercept = mean(price)),
                                       linetype="dashed")

Two trends are interesting to highlight here. The vast majority of Ideal cut diamonds cost less that 2000 dollars meanwhile the majority of fairly cut diamonds cost over 3000 dollars. most of the cuts reach their peak price between 500 & 1000 dollars meanwhile the mean is around 4000 dollars.

Apply algorithms & Find Accurate Model

Now it is time to create some models of the data and estimate their accuracy on unseen data.

We will cover step by step :

Define the Test Harness

We will 10-fold cross validation to estimate accuracy.

This will split our data set into 10 parts, train in 9 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate.

control <- trainControl(method="cv", number=10)
metric <- "Accuracy"

Apply 5 models

We don’t know which algorithm would help us predict the price of the diamonds based on the cut. As we saw earlier, the results are confusing, we are therefore not sure if a good model is on the list. It is, however, worth the try.

Our 5 models are :

  • Linear Discriminant Analysis (LDA)
  • Classification and Regression Trees (CART)
  • k-Nearest Neighbors (kNN)
  • Support Vector Machines (SVM) with a linear kernel
  • Random Forest (RF)
# Linear algorithms
set.seed(7)
fit.lda <- train(diamond_cut~., data=df2, method="lda", metric=metric, trControl=control)

# Nonlinear algorithms
# CART
set.seed(7)
fit.cart <- train(diamond_cut~., data=df2, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn <- train(diamond_cut~., data=df2, method="knn", metric=metric, trControl=control)

# Advanced algorithms
# SVM
set.seed(7)
fit.svm <- train(diamond_cut~., data=df2, method="svmRadial", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf <- train(diamond_cut~., data=df2, method="rf", metric=metric, trControl=control)

Select Best Model

We need to compare the accuracy of each model by summarizing the results.

results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: lda, cart, knn, svm, rf 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda  0.5959666 0.5974620 0.6006256 0.6025764 0.6058622 0.6193234    0
## cart 0.6179879 0.6303806 0.6324453 0.6388422 0.6490730 0.6621872    0
## knn  0.4939731 0.4975672 0.5018536 0.5023055 0.5071842 0.5097312    0
## svm  0.7105446 0.7148484 0.7217845 0.7207373 0.7258457 0.7300116    0
## rf   0.7079954 0.7132762 0.7150139 0.7154077 0.7168434 0.7270621    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda  0.4056660 0.4087296 0.4115186 0.4152333 0.4189034 0.4420640    0
## cart 0.4396649 0.4584129 0.4653805 0.4751598 0.4951663 0.5155481    0
## knn  0.2659795 0.2763016 0.2799776 0.2810824 0.2878024 0.2923562    0
## svm  0.5852015 0.5918072 0.6027307 0.6005706 0.6076233 0.6141957    0
## rf   0.5862947 0.5938564 0.5960904 0.5970346 0.5994481 0.6134101    0

We can also draw dot plot to identify that our most accurate model in svm although it doesn’t seem perfectly adapted, it is the best among the five.

dotplot(results)

print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 43153 samples
##     5 predictor
##     5 classes: 'Fair', 'Good', 'Ideal', 'Premium', 'Very Good' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 38837, 38839, 38839, 38838, 38837, 38837, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.7164038  0.5942208
##   0.50  0.7194859  0.5987180
##   1.00  0.7207373  0.6005706
## 
## Tuning parameter 'sigma' was held constant at a value of 0.4528849
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.4528849 and C = 1.

Testing The Model & Draw Conclusion

The SVM was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.

We can run a confusion matrix to apply the model to our data set and summarize the result to highlight its accuracy.

predictions <- predict(fit.svm, validation)

We have to convert predictions and our diamond_cut variable into factor.

dia <-as.factor(diamond_cut)
pred <-as.factor(predictions)

I got some errors saying the freshly factored variables we not the same length which was quite disappointing because it significantly reduces the size of the sample & therefore, the accuracy of the model.

length(dia)
## [1] 43153
length(pred)
## [1] 10787

So we need to use random sampling and limit the length of “dia” to the length of “pred”.

dia2 <- sample(dia, size = 10787)

Then we proceed to the confusion matrix.

confusionMatrix(pred,dia2)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Fair Good Ideal Premium Very Good
##   Fair         8   31   133      67        67
##   Good        19   63   298     202       169
##   Ideal      143  435  1985    1285      1124
##   Premium    120  376  1475     884       785
##   Very Good   41   97   458     292       230
## 
## Overall Statistics
##                                           
##                Accuracy : 0.2939          
##                  95% CI : (0.2853, 0.3026)
##     No Information Rate : 0.4032          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0108         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: Fair Class: Good Class: Ideal Class: Premium
## Sensitivity            0.0241692     0.06287       0.4564        0.32381
## Specificity            0.9714996     0.92969       0.5360        0.65794
## Pos Pred Value         0.0261438     0.08389       0.3992        0.24286
## Neg Pred Value         0.9691823     0.90644       0.5935        0.74171
## Prevalence             0.0306851     0.09289       0.4032        0.25308
## Detection Rate         0.0007416     0.00584       0.1840        0.08195
## Detection Prevalence   0.0283675     0.06962       0.4609        0.33744
## Balanced Accuracy      0.4978344     0.49628       0.4962        0.49087
##                      Class: Very Good
## Sensitivity                   0.09684
## Specificity                   0.89444
## Pos Pred Value                0.20572
## Neg Pred Value                0.77816
## Prevalence                    0.22017
## Detection Rate                0.02132
## Detection Prevalence          0.10364
## Balanced Accuracy             0.49564

Our chosen model has an accuracy of 30.06% to predict the price of a diamond based on the cut, which makes it unreliable. This model would be reliable if it’s accuracy was around an expected margin of 97% +/-4%.

I also tried to replace the missing values of “pred” to reach the sample size of “dia”,but again the results are not proving that “SVM” is a good model to predict the value of diamond based on cut, with an accuracy of 30.29%.

pred2 <- sample(pred, size = 43153, replace = TRUE)
confusionMatrix(pred2,dia)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Fair Good Ideal Premium Very Good
##   Fair        37  107   455     302       284
##   Good        92  287  1279     750       642
##   Ideal      616 1811  7945    5090      4396
##   Premium    411 1295  5789    3773      3294
##   Very Good  132  425  1773    1118      1050
## 
## Overall Statistics
##                                          
##                Accuracy : 0.3034         
##                  95% CI : (0.299, 0.3077)
##     No Information Rate : 0.3995         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0038         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
## 
## Statistics by Class:
## 
##                      Class: Fair Class: Good Class: Ideal Class: Premium
## Sensitivity            0.0287267    0.073121       0.4608        0.34197
## Specificity            0.9725785    0.929566       0.5403        0.66410
## Pos Pred Value         0.0312236    0.094098       0.4001        0.25910
## Neg Pred Value         0.9701916    0.909284       0.6009        0.74607
## Prevalence             0.0298473    0.090955       0.3995        0.25567
## Detection Rate         0.0008574    0.006651       0.1841        0.08743
## Detection Prevalence   0.0274604    0.070679       0.4602        0.33745
## Balanced Accuracy      0.5006526    0.501343       0.5005        0.50304
##                      Class: Very Good
## Sensitivity                   0.10863
## Specificity                   0.89703
## Pos Pred Value                0.23344
## Neg Pred Value                0.77711
## Prevalence                    0.22399
## Detection Rate                0.02433
## Detection Prevalence          0.10423
## Balanced Accuracy             0.50283