R Markdown About Diamond Dataset

This is an R Markdown document of diamonds dataset. We first want to have a glimpse of it and may visualize and use some sort of machine learning algorithms to practice.I trully hope this can be a tutorial of data science’s workflow.

1.Take a quick look of this dataset

carat	cut	color	clarity	depth	table	price	x	y	z
0.35	Premium	E	VS2	59.5	58	767	4.59	4.62	2.74
1.24	Very Good	I	VS2	60.1	59	5783	6.95	7.02	4.20
0.30	Ideal	E	VVS2	61.0	56	812	4.33	4.39	2.66
0.55	Very Good	G	VS2	62.3	56	1551	5.24	5.29	3.28
1.32	Ideal	F	VS2	62.3	57	7983	7.06	6.97	4.37
1.00	Good	H	VS1	61.8	61	5219	6.40	6.31	3.93
0.81	Premium	F	SI1	61.2	56	2926	6.03	6.00	3.68
1.00	Ideal	H	SI2	62.0	56	4649	6.38	6.43	3.97
1.27	Ideal	F	SI2	62.0	55	5405	6.95	6.90	4.30
0.23	Very Good	E	VVS2	61.1	59	505	3.94	3.98	2.42
0.23	Very Good	E	VVS1	62.3	58	530	3.88	3.99	2.45
0.30	Very Good	F	VVS2	62.4	58	737	4.25	4.28	2.66
0.50	Ideal	D	SI1	61.7	56	1623	5.16	5.11	3.17
1.00	Very Good	E	SI2	63.5	56	4704	6.38	6.31	4.03
1.50	Premium	G	SI2	61.8	60	6224	7.28	7.25	4.49

1.1 Let’s see the distribution of total caret and price

1.2 the relationship between caret and price

We get no surprize that there is one positive trend between those two features,the caret the diamond is,the more valuable it is,which is a common sense for everyone.

1.3 the relationship between x,y,z and price

x,y and z represents the 3d measurement of one diamond,I don’t know how to judge whether size is better or the potential relation between x,y,z and the price,we use ploty to find out

Well,seems like price has nothing to do with z

1.4 Heatmap to show inner relationship

Question:

Any mathmatical method can be applied to those categrical variables to extract the most significant combinations Is there any other raw data to be added to help understand the hightly related factors with the price Currently,I really don’t know how to make heatmap with plotly package

1.5 See how does data represents within factors

In this plot, the vertical sizes of the blocks and the widths of the stripes (called “alluvia”) are proportional to the frequency.Here, we see nicely how the fast vs slow groups fan out into the different categories.

We find If high_price==TRUE:

In cut,Ideal and Premium are hightly related.
In clarity,VS1,VS2,SI1 and SI2 cover all the true frequencies.
In color,H,G and F really matter.

1.6 Correlation inner

2.Machine Learning algorithms to be applied to diamond

This time we choose randomforest(https://cran.r-project.org/web/packages/randomForest/index.html) and xgboost(http://xgboost.readthedocs.io/en/latest/) to make models.All are based on decision tree but each of them use different optimal methods to fit the data.

2.1 Split data into train and test

ind<-sample(2,nrow(diamonds),replace = T,prob = c(0.7,0.3))
train.set<-diamonds[ind==1,]
test.set<-diamonds[ind==2,]

2.2 Build ML models

2.2.1 randomForest

library(randomForest)
rf.model<-randomForest(price~.,data=train.set,ntree=500,importance=T)
plot(rf.model,main = "Trend of  training Error along with the number of tree")

From the plot above,we find that training error stops lowing when tree number is around 200.Later we visualize the feature importance of MSE and node purity.

  imp<-tibble(name=rownames(rf.model$importance),MSE=rf.model$importance[,1],nodePurity=rf.model$importance[,2])
  p1<-imp %>% ggplot(aes(x=name,y=MSE,fill=name))+geom_bar(stat = "identity")+theme(legend.position = "none")
  p2<-imp %>% ggplot(aes(x=name,y=nodePurity,fill=name))+geom_bar(stat = "identity")+theme(legend.position = "none")
  gridExtra::grid.arrange(p1,p2)

let’s use this model to predict the test dataset and calculate the residuals

  rf.predict<-predict(rf.model,test.set)
  plot(c(test.set$price-rf.predict))

  mse.rf<-sqrt(mean((test.set$price-rf.predict)^2))

The RMSE of applying randomforest algorithm is 552.9567994

2.2.2 XGBoost

2.2.2.1 Making categorial variables into numeric variable,we use model.matrix function in Matrix package to build one-hot sparse matrix,because xgboost can only be applied to all numberic data.

  library(xgboost)
  xgb.train<-bind_cols(select_if(train.set,is.numeric),model.matrix(~cut-1,train.set) %>% as.tibble(),model.matrix(~color-1,train.set) %>% as.tibble(),model.matrix(~clarity-1,train.set) %>% as.tibble())
  xgboost.train<-xgb.DMatrix(data = as.matrix(select(xgb.train,-price)),label=xgb.train$price)
  xgb.test<-bind_cols(select_if(test.set,is.numeric),model.matrix(~cut-1,test.set) %>% as.tibble(),model.matrix(~color-1,test.set) %>% as.tibble(),model.matrix(~clarity-1,test.set) %>% as.tibble())
  xgboost.test<-xgb.DMatrix(data = select(xgb.test,-price) %>% as.matrix(),label=xgb.test$price)

The code above wa quite a data cleaning process,to get your data ready for xgboost,that is all you need to do:

Remove information about the target variable from the training data
Convert categorical information (like country) to a numeric format
Convert the cleaned dataframe to a Dmatrix

2.2.2.2 Let’s build a xgboost model

  param<-list(eval_metric='rmse',gamma=1,max_depth=6,nthread = 3)
  xg.model<-xgb.train(data = xgboost.train,params = param,watchlist = list(test=xgboost.test),nrounds = 500,early_stopping_rounds = 60,
                      print_every_n = 30)

## [1]  test-rmse:3976.106689 
## Will train until test_rmse hasn't improved in 60 rounds.
## 
## [31] test-rmse:574.323853 
## [61] test-rmse:564.800171 
## [91] test-rmse:564.286743 
## [121]    test-rmse:561.731018 
## [151]    test-rmse:563.636597 
## Stopping. Best iteration:
## [114]    test-rmse:561.354980

  xg.predict<-predict(xg.model,xgboost.test)
  mse.xgb<-sqrt(mean((test.set$price-xg.predict)^2))
  plot((test.set$price-xg.predict))

The RMSE of applying xgboost algorithm is 561.3550006

3.Conclusion

And that is all there is to it!In this data mining tutorial,I have showed you:

How to have a glimpse of data
The usage of ggplot2 for data visualization and dplyr for data manipulation
How to use corrplot package to plot correlation image
How to train a model with randomForest and XGBoost

Hope you enjoy my markdown

Let’s watch the Diamond

Ron Lee

June 15, 2018