利用Xgboost预测期货用户盈利概率
1.读取数据,并将年化收益率变为0-1变量
library(xgboost)
library(RMySQL)
attachNamespace('stringr')
library(stringr)
library(caTools)
library(DiagrammeR)
connectMySQL<-function(mysql,dbname,user,password,host){
drv<-dbDriver(mysql)
return(dbConnect(drv,dbname,user,password,host))
}
connect <- function()
{
con <- connectMySQL(mysql = "MySQL", dbname = "BDC_DAM", user = "biadmin", password = "Abcd1234", host = "10.130.2.248")
return(con)
}
con=connect()
dbSendQuery(con,"set names utf8;")
data=dbGetQuery(con,"select * from futures_user_fenbushi where substring(updatetime,1,10)='2016-10-26';")
library(data.table)
data1=as.data.table(data)
colnames(data1)=str_replace_all(colnames(data1),pattern = "/|\\-",replacement = "")
data1[,':='(y=ifelse(year_profit_rate>=0,1,0))]
data1[,':='(year_profit_rate=NULL,updatetime=NULL)]
col=str_detect(colnames(data1),"tag|adv_pre|start|id|end|updatetime|label|year_profit_rate")
data1=data1[,-col,with=F]
2.构建训练集和测试集
data1=data1[complete.cases(data1),]
split=sample.split(data1,0.8)
train=data1[split,]
test=data1[!split,]
train_matrix =as.matrix(train)
test_matrix =as.matrix(test)
train_label=train_matrix[,ncol(train_matrix)]
test_label=test_matrix[,ncol(test_matrix)]
dtrain=xgb.DMatrix(data=train_matrix[,-ncol(train_matrix)],label=train_label)
dtest=xgb.DMatrix(data=test_matrix[,-ncol(test_matrix)],label=test_label)
3.自定义误差函数、采用交叉验证确定最优的迭代次数进而构建模型
evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
err <- sqrt(mean((preds-labels)^2))
return(list(metric = "MSE", value = err))
}
param=list(objective='reg:logistic',max.depth=15,eta=0.3)
bst1=xgb.cv(dtrain,params = param,nrounds=1000,nfold = 20,prediction = T,feval=evalerror,maximize = F)
params=list(booster="gbtree",eta=1,gamma=1,max_depth=15,subsample=0.6,colsample=0.5,objective='reg:logistic',eval_metric=evalerror)
#开始构建模型
bst=xgboost(data = train_matrix[,-ncol(train_matrix)],label=train_label,params = params,nrounds = which.min(bst1$evaluation_log$train_MSE_std))
4.用测试集检测模型、观察模型效果
#预测值
pred=predict(bst,test_matrix[,-ncol(test_matrix)])
#模型评估
library(ROCR)
predd=prediction(pred,test_label)
perf=performance(predd,'tpr','fpr')
plot(perf,colorize=T,print.cutoffs.at=seq(0,1,by=0.1))
auc=unlist(performance(predd,measure = "auc")@y.values)
#auc
cutoff=performance(predd,measure = "prbe")
#cutoff@y.values

我们建模的目的是尽量多的找出盈利概率高的用户,因此更看中召回率,在如下结果中看出,模型还是很稳健的
- AUC为0.8468491
- 阈值为0.759134
- 精准率为0.6962111
- 召回率为0.7837014
5.查看重要变量
model <- xgb.dump(bst, with_stats = T)
names <- dimnames(train_matrix[,-ncol(train_matrix)])[[2]]
importance_matrix <- xgb.importance(names, model = bst)
xgb.plot.importance(importance_matrix[1:10,])

从图中可以看到前十个重要变量依次为:
- 历史最大回撤
- 总盈利
- 扣除最大盈利后收益率
- 换手率
- 手续费
- 盈利的星期数
- 生命周期
- 净仓位/总仓位
- 日均持仓品种数
- 多头盈利次数/空头盈利次数