Content
- Overview
- Data Dictionary
- Data Cleansing & Preparation
- Visualization & Correlation Analysis
- Apply algorithms & Find Accurate Model
- Testing The Model & Draw Conclusion
This document compromises one of the data science project done for price prediction. Data set source is from public domain and can be found here.
Primary objectives are to demonstrate following knowledge :
1.Explore, understand, cleanse & reorganized data with explained reasoning (Exploratory Data Analysis).
Complete visualizations to identify and exploit noticeable trends and patterns.
Highlight the correlation between the dependent variable (Price) with the independent and explain how they interact with each other.
Test harness & apply 5 different machine learning algorithms to our analysis,interpret the significance of our result through statistical analysis by choosing to test one specific model.
Test the chosen model & define if whether or not the designed models a good fit to predict price based on the cut of our diamond data set.
The bigger, the more is the diamond expensive, and in some sense this might be accurate. However, we all might have praised our friend’s rind or necklace seeing that water clear looking diamond, is it really a good indicator of its value ?
In this project, use case is determine how all variables (carat, depth, table) interact with its other to explain our dependent variable, the price.
At the end, we will test algorithms to decide if whether or not we can predict diamonds prices based on the cut only.
library('lattice')
library('ggplot2')
library('caret')
library('corrplot')
## corrplot 0.92 loaded
library("PerformanceAnalytics")
## Le chargement a nécessité le package : xts
## Le chargement a nécessité le package : zoo
##
## Attachement du package : 'zoo'
## Les objets suivants sont masqués depuis 'package:base':
##
## as.Date, as.Date.numeric
##
## Attachement du package : 'PerformanceAnalytics'
## L'objet suivant est masqué depuis 'package:graphics':
##
## legend
library('randomForest')
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attachement du package : 'randomForest'
## L'objet suivant est masqué depuis 'package:ggplot2':
##
## margin
data <- read.csv("diamonds.csv")
head(data)
## X carat cut color clarity depth table price x y z
## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Define the size by proceeding to calculation & get rid of the rows we do not include in our analysis, the variables that we are unable to include and cannot be considered are independent variables.
data$size <- data$x * data$y * data$z
data$X <- NULL
data$color <- NULL
data$x <- NULL
data$y <- NULL
data$z <- NULL
data$clarity <- NULL
summary(data)
## carat cut depth table
## Min. :0.2000 Length:53940 Min. :43.00 Min. :43.00
## 1st Qu.:0.4000 Class :character 1st Qu.:61.00 1st Qu.:56.00
## Median :0.7000 Mode :character Median :61.80 Median :57.00
## Mean :0.7979 Mean :61.75 Mean :57.46
## 3rd Qu.:1.0400 3rd Qu.:62.50 3rd Qu.:59.00
## Max. :5.0100 Max. :79.00 Max. :95.00
## price size
## Min. : 326 Min. : 0.00
## 1st Qu.: 950 1st Qu.: 65.14
## Median : 2401 Median : 114.81
## Mean : 3933 Mean : 129.85
## 3rd Qu.: 5324 3rd Qu.: 170.84
## Max. :18823 Max. :3840.60
Create a list of 80% of the rows in the original data set we can use for training.
validation_index <- createDataPartition(data$cut, p=0.80, list=FALSE)
Select 20% of the data for validation.
validation <- data[-validation_index,]
Use the remaining 80% of data to training and testing the models.
df <- data[validation_index,]
Take a look at the dimensions of the data set and the types of the attributes to help summarize the data before modeling it.
dim(df)
## [1] 43153 6
sapply(df,class)
## carat cut depth table price size
## "numeric" "character" "numeric" "numeric" "integer" "numeric"
Change the cut column to factor by replacing it with “diamond_cut” and bind this new column with the old data frame to create “df2”.
diamond_cut <- as.factor(df$cut)
df2 <- cbind(df, diamond_cut)
head(df2)
## carat cut depth table price size diamond_cut
## 1 0.23 Ideal 61.5 55 326 38.20203 Ideal
## 3 0.23 Good 56.9 65 327 38.07688 Good
## 5 0.31 Good 63.3 58 335 51.91725 Good
## 6 0.24 Very Good 62.8 57 336 38.69395 Very Good
## 7 0.24 Very Good 62.3 57 336 38.83087 Very Good
## 8 0.26 Very Good 61.9 55 337 42.32108 Very Good
df2$cut <- NULL
To prepare the models, let’s look at the distribution of the different levels of cut of the diamond data set.
percentage <- prop.table(table(df2$diamond_cut)) * 100
cbind(freq=table(df2$diamond_cut), percentage=percentage)
## freq percentage
## Fair 1288 2.984729
## Good 3925 9.095544
## Ideal 17241 39.953190
## Premium 11033 25.567168
## Very Good 9666 22.399370
Finally summarize it.
summary(df2)
## carat depth table price
## Min. :0.2000 Min. :43.00 Min. :43.00 Min. : 326
## 1st Qu.:0.4000 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 954
## Median :0.7000 Median :61.80 Median :57.00 Median : 2409
## Mean :0.7999 Mean :61.75 Mean :57.46 Mean : 3950
## 3rd Qu.:1.0400 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5351
## Max. :5.0100 Max. :78.20 Max. :95.00 Max. :18823
## size diamond_cut
## Min. : 0.0 Fair : 1288
## 1st Qu.: 65.3 Good : 3925
## Median : 114.9 Ideal :17241
## Mean : 130.2 Premium :11033
## 3rd Qu.: 171.1 Very Good: 9666
## Max. :3840.6
Now that we have reorganized our data, we want to extend that with visualizations and finding correlations between each variables.
First, let’s look how the price of diamonds evolve for each different levels of cut as the carat increases.
ggplot(df2, aes(x=carat, y=price, fill=diamond_cut)) +
geom_boxplot() + geom_smooth(linetype = "dashed")+
facet_wrap(~diamond_cut)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Interesting results. Except “fair” cut, the price of diamonds reaches its peak between 2 & 3 carats and then decreases or stabilizes. “Fair” cut diamond prices keep increasing as the carat keeps increasing to reach around $20000 at 5 carats. Even though the quality is the lowest on the grid, its price increases dramatically passed other references peak.
Recall the distribution of diamond regarding the levels of cut.
plot(percentage)
Now we can look at the interactions between the variables. First, let’s draw a big picture using the numerical variables are or disposal.
cor.result <- cor(df2[,c(1,2,3,4)])
corrplot(cor.result, method = 'number')
We spot a strong correlation between carat and price, the prices
increase as the carats increase. The depth explains poorly the price and
show an almost unexisting correlation. Table & carat have a very
small positive correlation. Table and depth have a small negative
correlation, we can suppose that as we move through tables, depth
gradually decreases.
But here is more detailed multivariate plots visualization at 10%, 5% & 1% level of significance.
data <- df2[, 1:4] # Numerical variables
cut <- df2[, 5] # Factor variable (groups)
chart.Correlation(data, histogram = TRUE, method = "pearson")
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
We can also see the difference in distribution of each cuts by class value. We can see the Gaussian-like distribution (bell curve) of each attribute through density plot.
ggplot(df, aes(x=price, fill = diamond_cut)) +
geom_density(alpha=0.4) +
geom_vline(data=df2, aes(xintercept = mean(price)),
linetype="dashed")
Two trends are interesting to highlight here. The vast majority of Ideal cut diamonds cost less that 2000 dollars meanwhile the majority of fairly cut diamonds cost over 3000 dollars. most of the cuts reach their peak price between 500 & 1000 dollars meanwhile the mean is around 4000 dollars.
Now it is time to create some models of the data and estimate their accuracy on unseen data.
We will cover step by step :
We will 10-fold cross validation to estimate accuracy.
This will split our data set into 10 parts, train in 9 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate.
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
We don’t know which algorithm would help us predict the price of the diamonds based on the cut. As we saw earlier, the results are confusing, we are therefore not sure if a good model is on the list. It is, however, worth the try.
Our 5 models are :
# Linear algorithms
set.seed(7)
fit.lda <- train(diamond_cut~., data=df2, method="lda", metric=metric, trControl=control)
# Nonlinear algorithms
# CART
set.seed(7)
fit.cart <- train(diamond_cut~., data=df2, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn <- train(diamond_cut~., data=df2, method="knn", metric=metric, trControl=control)
# Advanced algorithms
# SVM
set.seed(7)
fit.svm <- train(diamond_cut~., data=df2, method="svmRadial", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf <- train(diamond_cut~., data=df2, method="rf", metric=metric, trControl=control)
We need to compare the accuracy of each model by summarizing the results.
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: lda, cart, knn, svm, rf
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.5959666 0.5974620 0.6006256 0.6025764 0.6058622 0.6193234 0
## cart 0.6179879 0.6303806 0.6324453 0.6388422 0.6490730 0.6621872 0
## knn 0.4939731 0.4975672 0.5018536 0.5023055 0.5071842 0.5097312 0
## svm 0.7105446 0.7148484 0.7217845 0.7207373 0.7258457 0.7300116 0
## rf 0.7079954 0.7132762 0.7150139 0.7154077 0.7168434 0.7270621 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.4056660 0.4087296 0.4115186 0.4152333 0.4189034 0.4420640 0
## cart 0.4396649 0.4584129 0.4653805 0.4751598 0.4951663 0.5155481 0
## knn 0.2659795 0.2763016 0.2799776 0.2810824 0.2878024 0.2923562 0
## svm 0.5852015 0.5918072 0.6027307 0.6005706 0.6076233 0.6141957 0
## rf 0.5862947 0.5938564 0.5960904 0.5970346 0.5994481 0.6134101 0
We can also draw dot plot to identify that our most accurate model in svm although it doesn’t seem perfectly adapted, it is the best among the five.
dotplot(results)
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel
##
## 43153 samples
## 5 predictor
## 5 classes: 'Fair', 'Good', 'Ideal', 'Premium', 'Very Good'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 38837, 38839, 38839, 38838, 38837, 38837, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.7164038 0.5942208
## 0.50 0.7194859 0.5987180
## 1.00 0.7207373 0.6005706
##
## Tuning parameter 'sigma' was held constant at a value of 0.4528849
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.4528849 and C = 1.
The SVM was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.
We can run a confusion matrix to apply the model to our data set and summarize the result to highlight its accuracy.
predictions <- predict(fit.svm, validation)
We have to convert predictions and our diamond_cut variable into factor.
dia <-as.factor(diamond_cut)
pred <-as.factor(predictions)
I got some errors saying the freshly factored variables we not the same length which was quite disappointing because it significantly reduces the size of the sample & therefore, the accuracy of the model.
length(dia)
## [1] 43153
length(pred)
## [1] 10787
So we need to use random sampling and limit the length of “dia” to the length of “pred”.
dia2 <- sample(dia, size = 10787)
Then we proceed to the confusion matrix.
confusionMatrix(pred,dia2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Fair Good Ideal Premium Very Good
## Fair 8 31 133 67 67
## Good 19 63 298 202 169
## Ideal 143 435 1985 1285 1124
## Premium 120 376 1475 884 785
## Very Good 41 97 458 292 230
##
## Overall Statistics
##
## Accuracy : 0.2939
## 95% CI : (0.2853, 0.3026)
## No Information Rate : 0.4032
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0108
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: Fair Class: Good Class: Ideal Class: Premium
## Sensitivity 0.0241692 0.06287 0.4564 0.32381
## Specificity 0.9714996 0.92969 0.5360 0.65794
## Pos Pred Value 0.0261438 0.08389 0.3992 0.24286
## Neg Pred Value 0.9691823 0.90644 0.5935 0.74171
## Prevalence 0.0306851 0.09289 0.4032 0.25308
## Detection Rate 0.0007416 0.00584 0.1840 0.08195
## Detection Prevalence 0.0283675 0.06962 0.4609 0.33744
## Balanced Accuracy 0.4978344 0.49628 0.4962 0.49087
## Class: Very Good
## Sensitivity 0.09684
## Specificity 0.89444
## Pos Pred Value 0.20572
## Neg Pred Value 0.77816
## Prevalence 0.22017
## Detection Rate 0.02132
## Detection Prevalence 0.10364
## Balanced Accuracy 0.49564
Our chosen model has an accuracy of 30.06% to predict the price of a diamond based on the cut, which makes it unreliable. This model would be reliable if it’s accuracy was around an expected margin of 97% +/-4%.
I also tried to replace the missing values of “pred” to reach the sample size of “dia”,but again the results are not proving that “SVM” is a good model to predict the value of diamond based on cut, with an accuracy of 30.29%.
pred2 <- sample(pred, size = 43153, replace = TRUE)
confusionMatrix(pred2,dia)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Fair Good Ideal Premium Very Good
## Fair 37 107 455 302 284
## Good 92 287 1279 750 642
## Ideal 616 1811 7945 5090 4396
## Premium 411 1295 5789 3773 3294
## Very Good 132 425 1773 1118 1050
##
## Overall Statistics
##
## Accuracy : 0.3034
## 95% CI : (0.299, 0.3077)
## No Information Rate : 0.3995
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0038
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: Fair Class: Good Class: Ideal Class: Premium
## Sensitivity 0.0287267 0.073121 0.4608 0.34197
## Specificity 0.9725785 0.929566 0.5403 0.66410
## Pos Pred Value 0.0312236 0.094098 0.4001 0.25910
## Neg Pred Value 0.9701916 0.909284 0.6009 0.74607
## Prevalence 0.0298473 0.090955 0.3995 0.25567
## Detection Rate 0.0008574 0.006651 0.1841 0.08743
## Detection Prevalence 0.0274604 0.070679 0.4602 0.33745
## Balanced Accuracy 0.5006526 0.501343 0.5005 0.50304
## Class: Very Good
## Sensitivity 0.10863
## Specificity 0.89703
## Pos Pred Value 0.23344
## Neg Pred Value 0.77711
## Prevalence 0.22399
## Detection Rate 0.02433
## Detection Prevalence 0.10423
## Balanced Accuracy 0.50283