R Markdown About Diamond Dataset

This is an R Markdown document of diamonds dataset. We first want to have a glimpse of it and may visualize and use some sort of machine learning algorithms to practice.I trully hope this can be a tutorial of data science’s workflow.

1.Take a quick look of this dataset

carat cut color clarity depth table price x y z
0.35 Premium E VS2 59.5 58 767 4.59 4.62 2.74
1.24 Very Good I VS2 60.1 59 5783 6.95 7.02 4.20
0.30 Ideal E VVS2 61.0 56 812 4.33 4.39 2.66
0.55 Very Good G VS2 62.3 56 1551 5.24 5.29 3.28
1.32 Ideal F VS2 62.3 57 7983 7.06 6.97 4.37
1.00 Good H VS1 61.8 61 5219 6.40 6.31 3.93
0.81 Premium F SI1 61.2 56 2926 6.03 6.00 3.68
1.00 Ideal H SI2 62.0 56 4649 6.38 6.43 3.97
1.27 Ideal F SI2 62.0 55 5405 6.95 6.90 4.30
0.23 Very Good E VVS2 61.1 59 505 3.94 3.98 2.42
0.23 Very Good E VVS1 62.3 58 530 3.88 3.99 2.45
0.30 Very Good F VVS2 62.4 58 737 4.25 4.28 2.66
0.50 Ideal D SI1 61.7 56 1623 5.16 5.11 3.17
1.00 Very Good E SI2 63.5 56 4704 6.38 6.31 4.03
1.50 Premium G SI2 61.8 60 6224 7.28 7.25 4.49

1.1 Let’s see the distribution of total caret and price

1.2 the relationship between caret and price

We get no surprize that there is one positive trend between those two features,the caret the diamond is,the more valuable it is,which is a common sense for everyone.

1.3 the relationship between x,y,z and price

x,y and z represents the 3d measurement of one diamond,I don’t know how to judge whether size is better or the potential relation between x,y,z and the price,we use ploty to find out

Well,seems like price has nothing to do with z

1.4 Heatmap to show inner relationship

Question:

Any mathmatical method can be applied to those categrical variables to extract the most significant combinations Is there any other raw data to be added to help understand the hightly related factors with the price Currently,I really don’t know how to make heatmap with plotly package

1.5 See how does data represents within factors

In this plot, the vertical sizes of the blocks and the widths of the stripes (called “alluvia”) are proportional to the frequency.Here, we see nicely how the fast vs slow groups fan out into the different categories.

We find If high_price==TRUE:

  • In cut,Ideal and Premium are hightly related.
  • In clarity,VS1,VS2,SI1 and SI2 cover all the true frequencies.
  • In color,H,G and F really matter.

1.6 Correlation inner

2.Machine Learning algorithms to be applied to diamond

This time we choose randomforest(https://cran.r-project.org/web/packages/randomForest/index.html) and xgboost(http://xgboost.readthedocs.io/en/latest/) to make models.All are based on decision tree but each of them use different optimal methods to fit the data.

2.1 Split data into train and test

ind<-sample(2,nrow(diamonds),replace = T,prob = c(0.7,0.3))
train.set<-diamonds[ind==1,]
test.set<-diamonds[ind==2,]

2.2 Build ML models

2.2.1 randomForest

library(randomForest)
rf.model<-randomForest(price~.,data=train.set,ntree=500,importance=T)
plot(rf.model,main = "Trend of  training Error along with the number of tree")

  1. From the plot above,we find that training error stops lowing when tree number is around 200.Later we visualize the feature importance of MSE and node purity.
  imp<-tibble(name=rownames(rf.model$importance),MSE=rf.model$importance[,1],nodePurity=rf.model$importance[,2])
  p1<-imp %>% ggplot(aes(x=name,y=MSE,fill=name))+geom_bar(stat = "identity")+theme(legend.position = "none")
  p2<-imp %>% ggplot(aes(x=name,y=nodePurity,fill=name))+geom_bar(stat = "identity")+theme(legend.position = "none")
  gridExtra::grid.arrange(p1,p2)

  1. let’s use this model to predict the test dataset and calculate the residuals
  rf.predict<-predict(rf.model,test.set)
  plot(c(test.set$price-rf.predict))

  mse.rf<-sqrt(mean((test.set$price-rf.predict)^2))

The RMSE of applying randomforest algorithm is 552.9567994

2.2.2 XGBoost

2.2.2.1 Making categorial variables into numeric variable,we use model.matrix function in Matrix package to build one-hot sparse matrix,because xgboost can only be applied to all numberic data.

  library(xgboost)
  xgb.train<-bind_cols(select_if(train.set,is.numeric),model.matrix(~cut-1,train.set) %>% as.tibble(),model.matrix(~color-1,train.set) %>% as.tibble(),model.matrix(~clarity-1,train.set) %>% as.tibble())
  xgboost.train<-xgb.DMatrix(data = as.matrix(select(xgb.train,-price)),label=xgb.train$price)
  xgb.test<-bind_cols(select_if(test.set,is.numeric),model.matrix(~cut-1,test.set) %>% as.tibble(),model.matrix(~color-1,test.set) %>% as.tibble(),model.matrix(~clarity-1,test.set) %>% as.tibble())
  xgboost.test<-xgb.DMatrix(data = select(xgb.test,-price) %>% as.matrix(),label=xgb.test$price)

The code above wa quite a data cleaning process,to get your data ready for xgboost,that is all you need to do:

  • Remove information about the target variable from the training data
  • Convert categorical information (like country) to a numeric format
  • Convert the cleaned dataframe to a Dmatrix

2.2.2.2 Let’s build a xgboost model

  param<-list(eval_metric='rmse',gamma=1,max_depth=6,nthread = 3)
  xg.model<-xgb.train(data = xgboost.train,params = param,watchlist = list(test=xgboost.test),nrounds = 500,early_stopping_rounds = 60,
                      print_every_n = 30)
## [1]  test-rmse:3976.106689 
## Will train until test_rmse hasn't improved in 60 rounds.
## 
## [31] test-rmse:574.323853 
## [61] test-rmse:564.800171 
## [91] test-rmse:564.286743 
## [121]    test-rmse:561.731018 
## [151]    test-rmse:563.636597 
## Stopping. Best iteration:
## [114]    test-rmse:561.354980
  xg.predict<-predict(xg.model,xgboost.test)
  mse.xgb<-sqrt(mean((test.set$price-xg.predict)^2))
  plot((test.set$price-xg.predict))

The RMSE of applying xgboost algorithm is 561.3550006

3.Conclusion


And that is all there is to it!In this data mining tutorial,I have showed you:

  1. How to have a glimpse of data
  2. The usage of ggplot2 for data visualization and dplyr for data manipulation
  3. How to use corrplot package to plot correlation image
  4. How to train a model with randomForest and XGBoost

Hope you enjoy my markdown