This is an R Markdown document of diamonds dataset. We first want to have a glimpse of it and may visualize and use some sort of machine learning algorithms to practice.I trully hope this can be a tutorial of data science’s workflow.
| carat | cut | color | clarity | depth | table | price | x | y | z |
|---|---|---|---|---|---|---|---|---|---|
| 0.35 | Premium | E | VS2 | 59.5 | 58 | 767 | 4.59 | 4.62 | 2.74 |
| 1.24 | Very Good | I | VS2 | 60.1 | 59 | 5783 | 6.95 | 7.02 | 4.20 |
| 0.30 | Ideal | E | VVS2 | 61.0 | 56 | 812 | 4.33 | 4.39 | 2.66 |
| 0.55 | Very Good | G | VS2 | 62.3 | 56 | 1551 | 5.24 | 5.29 | 3.28 |
| 1.32 | Ideal | F | VS2 | 62.3 | 57 | 7983 | 7.06 | 6.97 | 4.37 |
| 1.00 | Good | H | VS1 | 61.8 | 61 | 5219 | 6.40 | 6.31 | 3.93 |
| 0.81 | Premium | F | SI1 | 61.2 | 56 | 2926 | 6.03 | 6.00 | 3.68 |
| 1.00 | Ideal | H | SI2 | 62.0 | 56 | 4649 | 6.38 | 6.43 | 3.97 |
| 1.27 | Ideal | F | SI2 | 62.0 | 55 | 5405 | 6.95 | 6.90 | 4.30 |
| 0.23 | Very Good | E | VVS2 | 61.1 | 59 | 505 | 3.94 | 3.98 | 2.42 |
| 0.23 | Very Good | E | VVS1 | 62.3 | 58 | 530 | 3.88 | 3.99 | 2.45 |
| 0.30 | Very Good | F | VVS2 | 62.4 | 58 | 737 | 4.25 | 4.28 | 2.66 |
| 0.50 | Ideal | D | SI1 | 61.7 | 56 | 1623 | 5.16 | 5.11 | 3.17 |
| 1.00 | Very Good | E | SI2 | 63.5 | 56 | 4704 | 6.38 | 6.31 | 4.03 |
| 1.50 | Premium | G | SI2 | 61.8 | 60 | 6224 | 7.28 | 7.25 | 4.49 |
We get no surprize that there is one positive trend between those two features,the caret the diamond is,the more valuable it is,which is a common sense for everyone.
x,y and z represents the 3d measurement of one diamond,I don’t know how to judge whether size is better or the potential relation between x,y,z and the price,we use ploty to find out
Well,seems like price has nothing to do with zQuestion:
Any mathmatical method can be applied to those categrical variables to extract the most significant combinations Is there any other raw data to be added to help understand the hightly related factors with the price Currently,I really don’t know how to make heatmap with plotly package
In this plot, the vertical sizes of the blocks and the widths of the stripes (called “alluvia”) are proportional to the frequency.Here, we see nicely how the fast vs slow groups fan out into the different categories.
We find If high_price==TRUE:
This time we choose randomforest(https://cran.r-project.org/web/packages/randomForest/index.html) and xgboost(http://xgboost.readthedocs.io/en/latest/) to make models.All are based on decision tree but each of them use different optimal methods to fit the data.
ind<-sample(2,nrow(diamonds),replace = T,prob = c(0.7,0.3))
train.set<-diamonds[ind==1,]
test.set<-diamonds[ind==2,]
library(randomForest)
rf.model<-randomForest(price~.,data=train.set,ntree=500,importance=T)
plot(rf.model,main = "Trend of training Error along with the number of tree")
imp<-tibble(name=rownames(rf.model$importance),MSE=rf.model$importance[,1],nodePurity=rf.model$importance[,2])
p1<-imp %>% ggplot(aes(x=name,y=MSE,fill=name))+geom_bar(stat = "identity")+theme(legend.position = "none")
p2<-imp %>% ggplot(aes(x=name,y=nodePurity,fill=name))+geom_bar(stat = "identity")+theme(legend.position = "none")
gridExtra::grid.arrange(p1,p2)
rf.predict<-predict(rf.model,test.set)
plot(c(test.set$price-rf.predict))
mse.rf<-sqrt(mean((test.set$price-rf.predict)^2))
The RMSE of applying randomforest algorithm is 552.9567994
library(xgboost)
xgb.train<-bind_cols(select_if(train.set,is.numeric),model.matrix(~cut-1,train.set) %>% as.tibble(),model.matrix(~color-1,train.set) %>% as.tibble(),model.matrix(~clarity-1,train.set) %>% as.tibble())
xgboost.train<-xgb.DMatrix(data = as.matrix(select(xgb.train,-price)),label=xgb.train$price)
xgb.test<-bind_cols(select_if(test.set,is.numeric),model.matrix(~cut-1,test.set) %>% as.tibble(),model.matrix(~color-1,test.set) %>% as.tibble(),model.matrix(~clarity-1,test.set) %>% as.tibble())
xgboost.test<-xgb.DMatrix(data = select(xgb.test,-price) %>% as.matrix(),label=xgb.test$price)
The code above wa quite a data cleaning process,to get your data ready for xgboost,that is all you need to do:
param<-list(eval_metric='rmse',gamma=1,max_depth=6,nthread = 3)
xg.model<-xgb.train(data = xgboost.train,params = param,watchlist = list(test=xgboost.test),nrounds = 500,early_stopping_rounds = 60,
print_every_n = 30)
## [1] test-rmse:3976.106689
## Will train until test_rmse hasn't improved in 60 rounds.
##
## [31] test-rmse:574.323853
## [61] test-rmse:564.800171
## [91] test-rmse:564.286743
## [121] test-rmse:561.731018
## [151] test-rmse:563.636597
## Stopping. Best iteration:
## [114] test-rmse:561.354980
xg.predict<-predict(xg.model,xgboost.test)
mse.xgb<-sqrt(mean((test.set$price-xg.predict)^2))
plot((test.set$price-xg.predict))
The RMSE of applying xgboost algorithm is 561.3550006
And that is all there is to it!In this data mining tutorial,I have showed you:
Hope you enjoy my markdown