一场dota比赛胜负预测建模(BP之前),(可以选取一场比赛来举例说明)
列举模型的关键因子库(说明提取方法和原因)
如何建模,并确定相关权重(说明方法)
如何训练和验证(说明思路)
数据来源 : https://www.kaggle.com/devinanzelmo/dota-2-matches/downloads/dota-2-matches.zip/3
数据集有19个,分别如下
首先,准备数据
library(readr)
ability_ids <- read_csv("/Users/milin/Downloads/dota-2-matches/ability_ids.csv",progress = F)
ability_upgrades <- read_csv("/Users/milin/Downloads/dota-2-matches/ability_upgrades.csv",progress = F)
chat <- read_csv("/Users/milin/Downloads/dota-2-matches/chat.csv",progress = F)
cluster_regions <- read_csv("/Users/milin/Downloads/dota-2-matches/cluster_regions.csv",progress = F)
hero_names <- read_csv("/Users/milin/Downloads/dota-2-matches/hero_names.csv",progress = F)
item_ids <- read_csv("/Users/milin/Downloads/dota-2-matches/item_ids.csv",progress = F)
match_outcomes <- read_csv("/Users/milin/Downloads/dota-2-matches/match_outcomes.csv",progress = F)
match <- read_csv("/Users/milin/Downloads/dota-2-matches/match.csv",progress = F)
objectives <- read_csv("/Users/milin/Downloads/dota-2-matches/objectives.csv",progress = F)
patch_dates <- read_csv("/Users/milin/Downloads/dota-2-matches/patch_dates.csv",progress = F)
player_ratings <- read_csv("/Users/milin/Downloads/dota-2-matches/player_ratings.csv",progress = F)
player_time <- read_csv("/Users/milin/Downloads/dota-2-matches/player_time.csv",progress = F)
players <- read_csv("/Users/milin/Downloads/dota-2-matches/players.csv",progress = F)
purchase_log <- read_csv("/Users/milin/Downloads/dota-2-matches/purchase_log.csv",progress = F)
teamfights_players <- read_csv("/Users/milin/Downloads/dota-2-matches/teamfights_players.csv",progress = F)
teamfights <- read_csv("/Users/milin/Downloads/dota-2-matches/teamfights.csv",progress = F)
test_labels <- read_csv("/Users/milin/Downloads/dota-2-matches/test_labels.csv",progress = F)
test_player <- read_csv("/Users/milin/Downloads/dota-2-matches/test_player.csv",progress = F)
teamfights <- read_csv("/Users/milin/Downloads/dota-2-matches/teamfights.csv",progress = F)
test_labels 数据集中包含了每一场比较,天辉是否胜利的信息,如果胜利,记为1,否则记为0,一共100000场游戏。
下面列举一下有关字段的解释:
match 中记录了每一场比赛的信息,players记录了每一场比赛中每一个玩家的信息,player_time记录了玩家每一分钟的信息,teamfights 记录的是团战的信息,player_ratings记录了玩家的信息
一场比赛的胜负主要有三个因素影响。
首先,通过历史比赛数据,对比赛数据的分布进行简单的了解:
library(ggplot2)
ggplot(match, aes(x=duration)) +
geom_histogram(aes(y=..density..), # 这一步很重要,使用density代替y轴
binwidth=.5,
colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666") # 重叠部分采用透明设置
可以发现,关于比赛时常的分布还是非常接近正态分布的,接下来看一下关于比赛时常的一些统计量:
summary(match$duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59 2029 2415 2476 2872 16037
可以发现,平均时常为2476秒,大约为41分钟.
要预测比赛的是否会胜利,一个关键点是在什么时间点进行预测。一般而言有两个时间点
不同时间点进行预测,那么模型能够使用的数据是不一样的,如果是在比赛之前进行预测,那么,能够使用到的数据就会比较少,比如选手的历史记录信息,
首先,建立一个在比赛之前进行预测胜率的模型,使用到的数据包括player_ratings和match.player_ratings记录了选手之前赢的总场数,进行比赛的总场数。击杀数量的平均值,击杀数的方差。接下来需要构造训练数据,将match和player_ratings通过account id 进行链接:
library(tidyverse)
# Radiant and true if dire
id <- players %>% select(match_id,account_id,player_slot) %>% distinct()
match1 <- match %>% left_join(id,"match_id")
match1 <- match1 %>% dplyr::mutate(side = case_when(player_slot %in% c(0,1,2,3,4) ~ "Radiant",player_slot %in% c(128,129,130,131,132)~"Dire"))
pre_traindata <- match1 %>% left_join(player_ratings,by = "account_id")
每一个比较包含5个选手的数据,因此需要将五个选手的数据汇总起来
pre_traindata_g <- pre_traindata %>% group_by(match_id,side,radiant_win) %>% summarise(win_rate = mean(total_wins)/mean(total_matches),kill = mean(trueskill_mu),sigma_win = sd(total_wins),sigma_matches = sd(total_matches),mu_win = mean(total_wins),mu_matches = mean(total_matches))
head(pre_traindata_g,3)
## # A tibble: 3 x 9
## # Groups: match_id, side [3]
## match_id side radiant_win win_rate kill sigma_win sigma_matches mu_win
## <dbl> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 Dire TRUE 0.485 28.7 880939. 1815713. 6.43e5
## 2 0 Radi… TRUE 0.485 24.8 880951. 1815730. 6.43e5
## 3 1 Dire FALSE 0.485 26.9 880935. 1815708. 9.65e5
## # … with 1 more variable: mu_matches <dbl>
Ra <- pre_traindata_g %>% filter(side =="Radiant")
Di <- pre_traindata_g %>% filter(side =="Dire")
names(Ra)[4:9] <- paste(names(Ra)[4:9],"Radiant",sep = ".")
names(Di)[4:9] <- paste(names(Di)[4:9],"Dire",sep = ".")
fdata <- cbind(Ra,Di) %>% select(-match_id1,-side1,-radiant_win1,-side)
fdata$radiant_win <- as.factor(as.numeric(fdata$radiant_win))
head(fdata,3)
## # A tibble: 3 x 15
## # Groups: match_id, side [3]
## side match_id radiant_win win_rate.Radiant kill.Radiant sigma_win.Radia…
## <chr> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 Radi… 0 1 0.485 24.8 880951.
## 2 Radi… 1 0 0.485 24.9 880948.
## 3 Radi… 2 0 NA NA NA
## # … with 9 more variables: sigma_matches.Radiant <dbl>,
## # mu_win.Radiant <dbl>, mu_matches.Radiant <dbl>, win_rate.Dire <dbl>,
## # kill.Dire <dbl>, sigma_win.Dire <dbl>, sigma_matches.Dire <dbl>,
## # mu_win.Dire <dbl>, mu_matches.Dire <dbl>
这里就构建好,赛前预测模型的数据,数据包含选手之前胜率的信息,胜利的场数,汇总的数据是每一个选手数据的平均。还可以添加其他的信息,比如说,选手之前的击杀数,金钱数,等等。这个模型本质上是而分类模型。
library(scorecard)
dt_f = var_filter(fdata[,-c(1,2)], y="radiant_win",iv_limit = 0.1) # 计算IV筛选变量
## [INFO] filtering variables ...
names(dt_f) # 筛选出这9个特征
## [1] "win_rate.Radiant" "sigma_win.Radiant" "mu_win.Radiant"
## [4] "mu_matches.Radiant" "win_rate.Dire" "sigma_win.Dire"
## [7] "mu_win.Dire" "mu_matches.Dire" "radiant_win"
# 划分数据集合
dt_list = split_df(dt_f, y="radiant_win", ratio = 0.6, seed = 30)
label_list = lapply(dt_list, function(x) x$radiant_win)
head(dt_list) # 训练集合占比0.6
## $train
## win_rate.Radiant sigma_win.Radiant mu_win.Radiant
## 1: 0.4851776 880951.4843 643364.0
## 2: 0.4851777 880948.3805 643367.4
## 3: 0.4851794 719293.0923 321687.4
## 4: 0.4851818 880935.4181 643381.6
## 5: 0.6271186 28.0749 22.2
## ---
## 29970: NA NA NA
## 29971: NA NA NA
## 29972: NA NA NA
## 29973: 0.4851768 719293.7631 321686.2
## 29974: 0.4851799 880935.6007 965053.6
## mu_matches.Radiant win_rate.Dire sigma_win.Dire mu_win.Dire
## 1: 1326038.0 0.4851814 880938.5 643378.2
## 2: 1326045.0 0.4851796 880935.3 965053.8
## 3: 663027.8 0.4851834 719288.5 321695.6
## 4: 1326062.8 0.4851782 880950.9 965042.4
## 5: 35.4 0.4851770 719293.1 321687.4
## ---
## 29970: NA NA NA NA
## 29971: NA NA NA NA
## 29972: NA 0.4851771 880951.2 965042.2
## 29973: 663028.8 0.4851781 880945.5 643370.6
## 29974: 1989063.4 NA NA NA
## mu_matches.Dire radiant_win
## 1: 1326057.0 1
## 2: 1989065.0 0
## 3: 663039.2 1
## 4: 1989047.2 1
## 5: 663031.0 1
## ---
## 29970: NA 0
## 29971: NA 1
## 29972: 1989051.4 1
## 29973: 1326050.4 1
## 29974: NA 0
##
## $test
## win_rate.Radiant sigma_win.Radiant mu_win.Radiant
## 1: NA NA NA
## 2: NA NA NA
## 3: 0.4851775 880955.04446 965039.4
## 4: 0.4851776 719296.55815 1286718.8
## 5: 0.5205993 47.48473 83.4
## ---
## 20022: 0.4851777 880953.85773 643361.4
## 20023: 0.4851777 880951.11913 643364.4
## 20024: 0.4851762 719296.55815 321681.2
## 20025: 0.4851775 0.00000 1608398.0
## 20026: NA NA NA
## mu_matches.Radiant win_rate.Dire sigma_win.Dire mu_win.Dire
## 1: NA 0.4851775 0.00000 1608398.0
## 2: NA NA NA NA
## 3: 1989044.0 NA NA NA
## 4: 2652057.2 0.4851763 719289.06741 321694.6
## 5: 160.2 0.4851811 880913.32756 643405.8
## ---
## 20022: 1326032.6 0.4851782 880947.92408 965044.6
## 20023: 1326038.8 0.4851759 719291.97428 321689.4
## 20024: 663019.4 0.4548872 25.57733 24.2
## 20025: 3315071.0 NA NA NA
## 20026: NA NA NA NA
## mu_matches.Dire radiant_win
## 1: 3315071.0 0
## 2: NA 0
## 3: NA 0
## 4: 663046.8 0
## 5: 1326114.6 0
## ---
## 20022: 1989051.8 0
## 20023: 663036.6 1
## 20024: 53.2 0
## 20025: NA 0
## 20026: NA 1
进行WOE binning
bins = woebin(dt_f, y="radiant_win")
## [INFO] creating woe binning ...
将数据转变成为WOE形式
dt_woe_list = lapply(dt_list, function(x) woebin_ply(x, bins))
## [INFO] converting into woe values ...
## [INFO] converting into woe values ...
训练模型
dt_woe_list$train$radiant_win <- as.factor(dt_woe_list$train$radiant_win)
dt_woe_list$test$radiant_win <- as.factor(dt_woe_list$test$radiant_win)
m1 = glm(radiant_win~ ., family = binomial(), data = dt_woe_list$train)
m_step = step(m1, direction="both", trace = FALSE)
m2 = eval(m_step$call)
summary(m_step)
##
## Call:
## glm(formula = radiant_win ~ win_rate.Radiant_woe + mu_win.Radiant_woe +
## win_rate.Dire_woe + mu_win.Dire_woe, family = binomial(),
## data = dt_woe_list$train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.369 -1.198 1.031 1.141 1.265
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.07533 0.01159 6.501 7.96e-11 ***
## win_rate.Radiant_woe 1.00204 0.16679 6.008 1.88e-09 ***
## mu_win.Radiant_woe 0.35619 0.22625 1.574 0.115412
## win_rate.Dire_woe 0.69011 0.20575 3.354 0.000796 ***
## mu_win.Dire_woe 0.68874 0.20686 3.329 0.000870 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41510 on 29973 degrees of freedom
## Residual deviance: 41371 on 29969 degrees of freedom
## AIC: 41381
##
## Number of Fisher Scoring iterations: 3
模型评估
pred_list = lapply(dt_woe_list, function(x) predict(m2, x, type='response'))
## performance
label_list$train <- (as.numeric(label_list$train))
label_list$test <- (as.numeric(label_list$test))
pred_list$test<- as.numeric(pred_list$test)
perf = scorecard::perf_eva(pred = pred_list$train, label = label_list$train,show_plot = c('ks', 'lift', 'gain', 'roc', 'lz', 'pr', 'f1', 'density'),confusion_matrix = F)
赛中模型,赛中模型则可以加入实时的赛况数据,比如前 5分钟,前十分钟,钱半个小时的,金钱,补刀,击杀等情况。接下来构建赛前30分钟的预测模型:
player30 <- player_time %>% filter(times == 30*60)
player30 %>% head(3)
## # A tibble: 3 x 32
## match_id times gold_t_0 lh_t_0 xp_t_0 gold_t_1 lh_t_1 xp_t_1 gold_t_2
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 1800 8124 20 7747 12590 83 17244 9064
## 2 1 1800 7519 35 7593 17176 227 24824 9017
## 3 2 1800 7286 49 8222 9374 104 11227 10976
## # … with 23 more variables: lh_t_2 <dbl>, xp_t_2 <dbl>, gold_t_3 <dbl>,
## # lh_t_3 <dbl>, xp_t_3 <dbl>, gold_t_4 <dbl>, lh_t_4 <dbl>,
## # xp_t_4 <dbl>, gold_t_128 <dbl>, lh_t_128 <dbl>, xp_t_128 <dbl>,
## # gold_t_129 <dbl>, lh_t_129 <dbl>, xp_t_129 <dbl>, gold_t_130 <dbl>,
## # lh_t_130 <dbl>, xp_t_130 <dbl>, gold_t_131 <dbl>, lh_t_131 <dbl>,
## # xp_t_131 <dbl>, gold_t_132 <dbl>, lh_t_132 <dbl>, xp_t_132 <dbl>
这个数据显示了三十分钟的时候,比赛中每一个位置的金钱,经验等信息。
fdata1 <- fdata %>% left_join(player30,by = "match_id") %>% select(-times)
head(fdata1,3)
## # A tibble: 3 x 45
## # Groups: match_id, side [3]
## side match_id radiant_win win_rate.Radiant kill.Radiant sigma_win.Radia…
## <chr> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 Radi… 0 1 0.485 24.8 880951.
## 2 Radi… 1 0 0.485 24.9 880948.
## 3 Radi… 2 0 NA NA NA
## # … with 39 more variables: sigma_matches.Radiant <dbl>,
## # mu_win.Radiant <dbl>, mu_matches.Radiant <dbl>, win_rate.Dire <dbl>,
## # kill.Dire <dbl>, sigma_win.Dire <dbl>, sigma_matches.Dire <dbl>,
## # mu_win.Dire <dbl>, mu_matches.Dire <dbl>, gold_t_0 <dbl>,
## # lh_t_0 <dbl>, xp_t_0 <dbl>, gold_t_1 <dbl>, lh_t_1 <dbl>,
## # xp_t_1 <dbl>, gold_t_2 <dbl>, lh_t_2 <dbl>, xp_t_2 <dbl>,
## # gold_t_3 <dbl>, lh_t_3 <dbl>, xp_t_3 <dbl>, gold_t_4 <dbl>,
## # lh_t_4 <dbl>, xp_t_4 <dbl>, gold_t_128 <dbl>, lh_t_128 <dbl>,
## # xp_t_128 <dbl>, gold_t_129 <dbl>, lh_t_129 <dbl>, xp_t_129 <dbl>,
## # gold_t_130 <dbl>, lh_t_130 <dbl>, xp_t_130 <dbl>, gold_t_131 <dbl>,
## # lh_t_131 <dbl>, xp_t_131 <dbl>, gold_t_132 <dbl>, lh_t_132 <dbl>,
## # xp_t_132 <dbl>
开始构建模型
library(scorecard)
dt_f = var_filter(fdata1[,-c(1,2)], y="radiant_win",iv_limit = 0.1) # 计算IV筛选变量
## [INFO] filtering variables ...
names(dt_f) # 筛选出这9个特征
## [1] "win_rate.Radiant" "sigma_win.Radiant" "mu_win.Radiant"
## [4] "mu_matches.Radiant" "win_rate.Dire" "sigma_win.Dire"
## [7] "mu_win.Dire" "mu_matches.Dire" "gold_t_0"
## [10] "xp_t_0" "gold_t_1" "xp_t_1"
## [13] "gold_t_2" "xp_t_2" "gold_t_3"
## [16] "xp_t_3" "gold_t_4" "xp_t_4"
## [19] "gold_t_128" "xp_t_128" "gold_t_129"
## [22] "xp_t_129" "gold_t_130" "xp_t_130"
## [25] "gold_t_131" "xp_t_131" "gold_t_132"
## [28] "xp_t_132" "radiant_win"
# 划分数据集合
dt_list = split_df(dt_f, y="radiant_win", ratio = 0.6, seed = 30)
label_list = lapply(dt_list, function(x) x$radiant_win)
head(dt_list) # 训练集合占比0.6
## $train
## win_rate.Radiant sigma_win.Radiant mu_win.Radiant
## 1: 0.4851776 880951.4843 643364.0
## 2: 0.4851777 880948.3805 643367.4
## 3: 0.4851794 719293.0923 321687.4
## 4: 0.4851818 880935.4181 643381.6
## 5: 0.6271186 28.0749 22.2
## ---
## 29970: NA NA NA
## 29971: NA NA NA
## 29972: NA NA NA
## 29973: 0.4851768 719293.7631 321686.2
## 29974: 0.4851799 880935.6007 965053.6
## mu_matches.Radiant win_rate.Dire sigma_win.Dire mu_win.Dire
## 1: 1326038.0 0.4851814 880938.5 643378.2
## 2: 1326045.0 0.4851796 880935.3 965053.8
## 3: 663027.8 0.4851834 719288.5 321695.6
## 4: 1326062.8 0.4851782 880950.9 965042.4
## 5: 35.4 0.4851770 719293.1 321687.4
## ---
## 29970: NA NA NA NA
## 29971: NA NA NA NA
## 29972: NA 0.4851771 880951.2 965042.2
## 29973: 663028.8 0.4851781 880945.5 643370.6
## 29974: 1989063.4 NA NA NA
## mu_matches.Dire gold_t_0 xp_t_0 gold_t_1 xp_t_1 gold_t_2 xp_t_2
## 1: 1326057.0 8124 7747 12590 17244 9064 11478
## 2: 1989065.0 7519 7593 17176 24824 9017 8129
## 3: 663039.2 20551 20131 18565 19751 11222 9651
## 4: 1989047.2 NA NA NA NA NA NA
## 5: 663031.0 7858 8399 19397 19934 13342 15405
## ---
## 29970: NA 12488 16070 8806 9627 8263 9550
## 29971: NA 12125 10374 7474 6284 10093 11001
## 29972: 1989051.4 NA NA NA NA NA NA
## 29973: 1326050.4 13364 14018 6564 8236 11859 11699
## 29974: NA 11245 11605 10767 14275 6790 6944
## gold_t_3 xp_t_3 gold_t_4 xp_t_4 gold_t_128 xp_t_128 gold_t_129
## 1: 14535 16479 15833 15888 11090 13848 9210
## 2: 12850 14219 11918 13471 14989 13547 11976
## 3: 9055 8581 17426 15411 8330 8077 7185
## 4: NA NA NA NA NA NA NA
## 5: 15839 14346 10794 10530 7735 8184 13901
## ---
## 29970: 9835 14612 8908 9961 9376 12471 11598
## 29971: 7326 7471 9797 13692 14743 20257 8075
## 29972: NA NA NA NA NA NA NA
## 29973: 13900 15067 7425 7580 8366 10618 7753
## 29974: 8515 10581 10526 13073 12117 16075 8815
## xp_t_129 gold_t_130 xp_t_130 gold_t_131 xp_t_131 gold_t_132
## 1: 10990 13087 15187 5495 6514 14136
## 2: 14782 10707 13468 12820 17403 12299
## 3: 6471 8407 9685 11963 12976 8957
## 4: NA NA NA NA NA NA
## 5: 14316 8872 10510 7883 8208 15383
## ---
## 29970: 16819 8882 12561 11924 13702 9429
## 29971: 9082 13374 13533 8986 10879 6684
## 29972: NA NA NA NA NA NA
## 29973: 8933 9409 12390 9442 9115 7684
## 29974: 10582 9059 10227 18643 17514 12139
## xp_t_132 radiant_win
## 1: 9975 1
## 2: 15694 0
## 3: 8549 1
## 4: NA 1
## 5: 15275 1
## ---
## 29970: 11602 0
## 29971: 6746 1
## 29972: NA 1
## 29973: 9342 1
## 29974: 13786 0
##
## $test
## win_rate.Radiant sigma_win.Radiant mu_win.Radiant
## 1: NA NA NA
## 2: NA NA NA
## 3: 0.4851775 880955.04446 965039.4
## 4: 0.4851776 719296.55815 1286718.8
## 5: 0.5205993 47.48473 83.4
## ---
## 20022: 0.4851777 880953.85773 643361.4
## 20023: 0.4851777 880951.11913 643364.4
## 20024: 0.4851762 719296.55815 321681.2
## 20025: 0.4851775 0.00000 1608398.0
## 20026: NA NA NA
## mu_matches.Radiant win_rate.Dire sigma_win.Dire mu_win.Dire
## 1: NA 0.4851775 0.00000 1608398.0
## 2: NA NA NA NA
## 3: 1989044.0 NA NA NA
## 4: 2652057.2 0.4851763 719289.06741 321694.6
## 5: 160.2 0.4851811 880913.32756 643405.8
## ---
## 20022: 1326032.6 0.4851782 880947.92408 965044.6
## 20023: 1326038.8 0.4851759 719291.97428 321689.4
## 20024: 663019.4 0.4548872 25.57733 24.2
## 20025: 3315071.0 NA NA NA
## 20026: NA NA NA NA
## mu_matches.Dire gold_t_0 xp_t_0 gold_t_1 xp_t_1 gold_t_2 xp_t_2
## 1: 3315071.0 7286 8222 9374 11227 10976 13636
## 2: NA 6931 8209 11483 11666 11242 14862
## 3: NA 12679 11277 10537 12191 7902 9006
## 4: 663046.8 14239 16473 7246 8144 4963 6175
## 5: 1326114.6 13168 15182 17086 17673 7032 7564
## ---
## 20022: 1989051.8 13465 10184 11088 12908 9390 10882
## 20023: 663036.6 15118 14335 13482 16385 13179 14259
## 20024: 53.2 7175 7566 11746 14862 8678 8135
## 20025: NA 14300 17215 9079 12120 6382 7930
## 20026: NA 11200 13284 8385 8216 9780 10511
## gold_t_3 xp_t_3 gold_t_4 xp_t_4 gold_t_128 xp_t_128 gold_t_129
## 1: 8974 9184 5446 6756 12231 14241 8098
## 2: 6827 8220 11178 12975 10588 12852 11219
## 3: 9718 9415 9712 12841 12773 14062 16759
## 4: 12953 15429 7112 6398 10998 12428 14088
## 5: 7683 8181 11233 10838 8056 6693 13003
## ---
## 20022: 8378 9727 7350 8444 6476 7621 5905
## 20023: 8890 9960 20868 22568 7637 8179 9035
## 20024: 13091 14267 11021 13690 8603 10708 14448
## 20025: 13624 16368 11043 13145 14264 13629 8782
## 20026: 12627 14918 12044 14374 10135 10899 11814
## xp_t_129 gold_t_130 xp_t_130 gold_t_131 xp_t_131 gold_t_132
## 1: 9854 10258 12905 8637 9214 6435
## 2: 14497 7531 8880 6620 7089 12975
## 3: 19796 16551 20137 16094 16415 11569
## 4: 15303 12901 15901 9454 11388 10211
## 5: 13724 16206 21295 8683 7994 7253
## ---
## 20022: 9206 10440 12521 11113 12231 6324
## 20023: 9259 6168 4835 10228 9673 10149
## 20024: 14723 10187 12028 11365 9187 19378
## 20025: 11315 7719 9194 7437 8844 16087
## 20026: 15385 10222 9986 10919 12756 9690
## xp_t_132 radiant_win
## 1: 7410 0
## 2: 15194 0
## 3: 11799 0
## 4: 10461 0
## 5: 9097 0
## ---
## 20022: 8229 0
## 20023: 11321 1
## 20024: 21755 0
## 20025: 18117 0
## 20026: 12439 1
进行WOE binning
bins = woebin(dt_f, y="radiant_win")
## [INFO] creating woe binning ...
将数据转变成为WOE形式
dt_woe_list = lapply(dt_list, function(x) woebin_ply(x, bins))
## [INFO] converting into woe values ...
## [INFO] converting into woe values ...
训练模型
dt_woe_list$train$radiant_win <- as.factor(dt_woe_list$train$radiant_win)
dt_woe_list$test$radiant_win <- as.factor(dt_woe_list$test$radiant_win)
m1 = glm(radiant_win~ ., family = binomial(), data = dt_woe_list$train)
m_step = step(m1, direction="both", trace = FALSE)
m2 = eval(m_step$call)
summary(m_step)
##
## Call:
## glm(formula = radiant_win ~ win_rate.Radiant_woe + mu_win.Radiant_woe +
## win_rate.Dire_woe + mu_win.Dire_woe + gold_t_0_woe + xp_t_0_woe +
## gold_t_1_woe + xp_t_1_woe + gold_t_2_woe + xp_t_2_woe + gold_t_3_woe +
## xp_t_3_woe + gold_t_4_woe + xp_t_4_woe + gold_t_128_woe +
## xp_t_128_woe + gold_t_129_woe + xp_t_129_woe + gold_t_130_woe +
## xp_t_130_woe + gold_t_131_woe + xp_t_131_woe + gold_t_132_woe +
## xp_t_132_woe, family = binomial(), data = dt_woe_list$train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6187 -0.8256 0.2682 0.8655 2.7140
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.13287 0.01425 9.327 < 2e-16 ***
## win_rate.Radiant_woe 0.52700 0.20305 2.595 0.009446 **
## mu_win.Radiant_woe -0.45375 0.27323 -1.661 0.096777 .
## win_rate.Dire_woe 0.98673 0.24986 3.949 7.85e-05 ***
## mu_win.Dire_woe 0.58683 0.25015 2.346 0.018982 *
## gold_t_0_woe 0.51334 0.04706 10.908 < 2e-16 ***
## xp_t_0_woe 0.19572 0.05296 3.696 0.000219 ***
## gold_t_1_woe 0.55667 0.04899 11.364 < 2e-16 ***
## xp_t_1_woe 0.19131 0.05529 3.460 0.000540 ***
## gold_t_2_woe 0.58219 0.05052 11.525 < 2e-16 ***
## xp_t_2_woe 0.14386 0.05674 2.535 0.011236 *
## gold_t_3_woe 0.45118 0.04630 9.744 < 2e-16 ***
## xp_t_3_woe 0.24236 0.05215 4.647 3.36e-06 ***
## gold_t_4_woe 0.64393 0.04766 13.512 < 2e-16 ***
## xp_t_4_woe 0.09584 0.05439 1.762 0.078037 .
## gold_t_128_woe 0.57828 0.04617 12.526 < 2e-16 ***
## xp_t_128_woe 0.16309 0.05060 3.223 0.001268 **
## gold_t_129_woe 0.51772 0.04984 10.387 < 2e-16 ***
## xp_t_129_woe 0.23303 0.05293 4.403 1.07e-05 ***
## gold_t_130_woe 0.45566 0.04905 9.289 < 2e-16 ***
## xp_t_130_woe 0.30801 0.05168 5.960 2.52e-09 ***
## gold_t_131_woe 0.45024 0.04701 9.577 < 2e-16 ***
## xp_t_131_woe 0.25150 0.05073 4.958 7.14e-07 ***
## gold_t_132_woe 0.57351 0.04877 11.759 < 2e-16 ***
## xp_t_132_woe 0.10997 0.05335 2.061 0.039286 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41510 on 29973 degrees of freedom
## Residual deviance: 30464 on 29949 degrees of freedom
## AIC: 30514
##
## Number of Fisher Scoring iterations: 5
模型评估
pred_list = lapply(dt_woe_list, function(x) predict(m2, x, type='response'))
## performance
label_list$train <- (as.numeric(label_list$train))
label_list$test <- (as.numeric(label_list$test))
pred_list$test<- as.numeric(pred_list$test)
perf = scorecard::perf_eva(pred = pred_list$train, label = label_list$train,show_plot = c('ks', 'lift', 'gain', 'roc', 'lz', 'pr', 'f1', 'density'),confusion_matrix = T)
## [INFO] The threshold of confusion matrix is 0.3917.
可以发现,赛中模型比赛前模型有更高的准确度,其中KS达到了0.55,AUC等于0.84.所以基本上,比赛半个小时,通过数据可以预测出比赛的结果了。然后,这里将时间提前,通过10分钟的数据,来来预测比赛:
player10 <- player_time %>% filter(times == 10*60)
player10 %>% head(3)
## # A tibble: 3 x 32
## match_id times gold_t_0 lh_t_0 xp_t_0 gold_t_1 lh_t_1 xp_t_1 gold_t_2
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 600 2211 3 1532 3379 39 3903 1650
## 2 1 600 1560 4 1393 3749 57 4065 2453
## 3 2 600 2561 17 2460 2380 23 3033 2869
## # … with 23 more variables: lh_t_2 <dbl>, xp_t_2 <dbl>, gold_t_3 <dbl>,
## # lh_t_3 <dbl>, xp_t_3 <dbl>, gold_t_4 <dbl>, lh_t_4 <dbl>,
## # xp_t_4 <dbl>, gold_t_128 <dbl>, lh_t_128 <dbl>, xp_t_128 <dbl>,
## # gold_t_129 <dbl>, lh_t_129 <dbl>, xp_t_129 <dbl>, gold_t_130 <dbl>,
## # lh_t_130 <dbl>, xp_t_130 <dbl>, gold_t_131 <dbl>, lh_t_131 <dbl>,
## # xp_t_131 <dbl>, gold_t_132 <dbl>, lh_t_132 <dbl>, xp_t_132 <dbl>
这个数据显示了三十分钟的时候,比赛中每一个位置的金钱,经验等信息。
fdata1 <- fdata %>% left_join(player10,by = "match_id") %>% select(-times)
head(fdata1,3)
## # A tibble: 3 x 45
## # Groups: match_id, side [3]
## side match_id radiant_win win_rate.Radiant kill.Radiant sigma_win.Radia…
## <chr> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 Radi… 0 1 0.485 24.8 880951.
## 2 Radi… 1 0 0.485 24.9 880948.
## 3 Radi… 2 0 NA NA NA
## # … with 39 more variables: sigma_matches.Radiant <dbl>,
## # mu_win.Radiant <dbl>, mu_matches.Radiant <dbl>, win_rate.Dire <dbl>,
## # kill.Dire <dbl>, sigma_win.Dire <dbl>, sigma_matches.Dire <dbl>,
## # mu_win.Dire <dbl>, mu_matches.Dire <dbl>, gold_t_0 <dbl>,
## # lh_t_0 <dbl>, xp_t_0 <dbl>, gold_t_1 <dbl>, lh_t_1 <dbl>,
## # xp_t_1 <dbl>, gold_t_2 <dbl>, lh_t_2 <dbl>, xp_t_2 <dbl>,
## # gold_t_3 <dbl>, lh_t_3 <dbl>, xp_t_3 <dbl>, gold_t_4 <dbl>,
## # lh_t_4 <dbl>, xp_t_4 <dbl>, gold_t_128 <dbl>, lh_t_128 <dbl>,
## # xp_t_128 <dbl>, gold_t_129 <dbl>, lh_t_129 <dbl>, xp_t_129 <dbl>,
## # gold_t_130 <dbl>, lh_t_130 <dbl>, xp_t_130 <dbl>, gold_t_131 <dbl>,
## # lh_t_131 <dbl>, xp_t_131 <dbl>, gold_t_132 <dbl>, lh_t_132 <dbl>,
## # xp_t_132 <dbl>
开始构建模型
library(scorecard)
dt_f = var_filter(fdata1[,-c(1,2)], y="radiant_win",iv_limit = 0.1) # 计算IV筛选变量
## [INFO] filtering variables ...
names(dt_f) # 筛选出这9个特征
## [1] "win_rate.Radiant" "sigma_win.Radiant" "mu_win.Radiant"
## [4] "mu_matches.Radiant" "win_rate.Dire" "sigma_win.Dire"
## [7] "mu_win.Dire" "mu_matches.Dire" "gold_t_0"
## [10] "xp_t_0" "gold_t_1" "xp_t_1"
## [13] "gold_t_2" "xp_t_2" "gold_t_3"
## [16] "xp_t_3" "gold_t_4" "xp_t_4"
## [19] "gold_t_128" "xp_t_128" "gold_t_129"
## [22] "xp_t_129" "gold_t_130" "xp_t_130"
## [25] "gold_t_131" "xp_t_131" "gold_t_132"
## [28] "xp_t_132" "radiant_win"
# 划分数据集合
dt_list = split_df(dt_f, y="radiant_win", ratio = 0.6, seed = 30)
label_list = lapply(dt_list, function(x) x$radiant_win)
head(dt_list) # 训练集合占比0.6
## $train
## win_rate.Radiant sigma_win.Radiant mu_win.Radiant
## 1: 0.4851776 880951.4843 643364.0
## 2: 0.4851777 880948.3805 643367.4
## 3: 0.4851794 719293.0923 321687.4
## 4: 0.4851818 880935.4181 643381.6
## 5: 0.6271186 28.0749 22.2
## ---
## 29970: NA NA NA
## 29971: NA NA NA
## 29972: NA NA NA
## 29973: 0.4851768 719293.7631 321686.2
## 29974: 0.4851799 880935.6007 965053.6
## mu_matches.Radiant win_rate.Dire sigma_win.Dire mu_win.Dire
## 1: 1326038.0 0.4851814 880938.5 643378.2
## 2: 1326045.0 0.4851796 880935.3 965053.8
## 3: 663027.8 0.4851834 719288.5 321695.6
## 4: 1326062.8 0.4851782 880950.9 965042.4
## 5: 35.4 0.4851770 719293.1 321687.4
## ---
## 29970: NA NA NA NA
## 29971: NA NA NA NA
## 29972: NA 0.4851771 880951.2 965042.2
## 29973: 663028.8 0.4851781 880945.5 643370.6
## 29974: 1989063.4 NA NA NA
## mu_matches.Dire gold_t_0 xp_t_0 gold_t_1 xp_t_1 gold_t_2 xp_t_2
## 1: 1326057.0 2211 1532 3379 3903 1650 1450
## 2: 1989065.0 1560 1393 3749 4065 2453 1774
## 3: 663039.2 4108 3802 4735 4778 2339 2365
## 4: 1989047.2 2590 2954 1328 1202 3517 3128
## 5: 663031.0 1590 1529 3932 3520 2243 2780
## ---
## 29970: NA 2688 3308 1445 1604 2411 2646
## 29971: NA 2568 1722 1878 1321 2548 3211
## 29972: 1989051.4 3448 3955 4082 4771 3079 2916
## 29973: 1326050.4 3285 4123 1197 1799 2901 2262
## 29974: NA 2788 2828 1994 2341 1590 1980
## gold_t_3 xp_t_3 gold_t_4 xp_t_4 gold_t_128 xp_t_128 gold_t_129
## 1: 2859 4017 3745 3464 2623 3395 2573
## 2: 2811 2207 3748 4364 5015 4095 3286
## 3: 1525 1723 2871 2534 1410 1685 1826
## 4: 4084 4095 1930 1698 3024 3016 1696
## 5: 4298 4148 2277 2035 2025 1525 3531
## ---
## 29970: 2189 2507 3041 3662 2004 2889 1995
## 29971: 1651 1611 2790 3609 3465 5118 1676
## 29972: 2522 2047 2806 2476 4079 4017 3718
## 29973: 2986 3175 1446 1755 2411 4103 1468
## 29974: 2649 3952 2976 3330 3375 3748 2206
## xp_t_129 gold_t_130 xp_t_130 gold_t_131 xp_t_131 gold_t_132
## 1: 3295 3853 4396 1058 315 4164
## 2: 3182 1741 2035 2869 3085 2991
## 3: 1752 2990 2778 3428 3108 1781
## 4: 1718 2440 2247 1401 1461 3398
## 5: 3440 2654 2578 2900 2325 3900
## ---
## 29970: 2580 2163 2098 2918 3697 1616
## 29971: 1580 3320 2547 2003 3170 1183
## 29972: 2828 1511 1447 3074 3318 2207
## 29973: 1513 2177 3310 2408 2145 1331
## 29974: 2444 2722 3020 4609 4428 2669
## xp_t_132 radiant_win
## 1: 2124 1
## 2: 3320 0
## 3: 1783 1
## 4: 3935 1
## 5: 3068 1
## ---
## 29970: 2372 0
## 29971: 1324 1
## 29972: 1765 1
## 29973: 1801 1
## 29974: 2740 0
##
## $test
## win_rate.Radiant sigma_win.Radiant mu_win.Radiant
## 1: NA NA NA
## 2: NA NA NA
## 3: 0.4851775 880955.04446 965039.4
## 4: 0.4851776 719296.55815 1286718.8
## 5: 0.5205993 47.48473 83.4
## ---
## 20022: 0.4851777 880953.85773 643361.4
## 20023: 0.4851777 880951.11913 643364.4
## 20024: 0.4851762 719296.55815 321681.2
## 20025: 0.4851775 0.00000 1608398.0
## 20026: NA NA NA
## mu_matches.Radiant win_rate.Dire sigma_win.Dire mu_win.Dire
## 1: NA 0.4851775 0.00000 1608398.0
## 2: NA NA NA NA
## 3: 1989044.0 NA NA NA
## 4: 2652057.2 0.4851763 719289.06741 321694.6
## 5: 160.2 0.4851811 880913.32756 643405.8
## ---
## 20022: 1326032.6 0.4851782 880947.92408 965044.6
## 20023: 1326038.8 0.4851759 719291.97428 321689.4
## 20024: 663019.4 0.4548872 25.57733 24.2
## 20025: 3315071.0 NA NA NA
## 20026: NA NA NA NA
## mu_matches.Dire gold_t_0 xp_t_0 gold_t_1 xp_t_1 gold_t_2 xp_t_2
## 1: 3315071.0 2561 2460 2380 3033 2869 3230
## 2: NA 1745 2100 2780 1935 1741 2781
## 3: NA 3511 2872 2518 2231 1898 2097
## 4: 663046.8 3164 3540 1306 1343 1164 1222
## 5: 1326114.6 3427 4228 4708 4695 1735 1777
## ---
## 20022: 1989051.8 2625 1977 2713 2789 2139 2955
## 20023: 663036.6 2850 2882 2986 4203 3133 3196
## 20024: 53.2 1448 1353 2548 3354 1768 1299
## 20025: NA 3768 3699 1994 1722 1824 2227
## 20026: NA 3777 3684 2272 2702 2848 2985
## gold_t_3 xp_t_3 gold_t_4 xp_t_4 gold_t_128 xp_t_128 gold_t_129
## 1: 2033 2172 1044 1560 3448 3088 1992
## 2: 1839 1848 2689 3721 1619 1539 2820
## 3: 2940 3690 1260 1870 1683 1706 2286
## 4: 2680 3186 2468 2119 2975 3559 3245
## 5: 1331 1412 3757 2321 2190 1688 3454
## ---
## 20022: 2011 2552 1625 2011 1924 2033 1111
## 20023: 1350 1469 3616 4510 2657 2264 2841
## 20024: 2949 2812 2715 3532 1530 1884 3637
## 20025: 2466 3298 2259 2578 3195 4067 2358
## 20026: 3714 3958 2957 2884 2228 2099 2624
## xp_t_129 gold_t_130 xp_t_130 gold_t_131 xp_t_131 gold_t_132
## 1: 2529 3559 4642 1974 1786 1120
## 2: 3590 1136 1947 1178 1367 4242
## 3: 2605 3383 4211 2287 2069 1462
## 4: 2556 3412 4338 2043 3242 2622
## 5: 3653 2617 3410 1299 1135 1734
## ---
## 20022: 1633 1752 2315 2998 4043 1267
## 20023: 3323 2310 1461 3133 2653 3184
## 20024: 4117 2432 2888 2252 2204 3664
## 20025: 2898 2034 1769 1935 1854 3502
## 20026: 3309 2693 2371 2134 2082 2396
## xp_t_132 radiant_win
## 1: 1524 0
## 2: 3676 0
## 3: 1975 0
## 4: 2425 0
## 5: 2216 0
## ---
## 20022: 1950 0
## 20023: 3699 1
## 20024: 3572 0
## 20025: 3658 0
## 20026: 3335 1
进行WOE binning
bins = woebin(dt_f, y="radiant_win")
## [INFO] creating woe binning ...
将数据转变成为WOE形式
dt_woe_list = lapply(dt_list, function(x) woebin_ply(x, bins))
## [INFO] converting into woe values ...
## [INFO] converting into woe values ...
训练模型
dt_woe_list$train$radiant_win <- as.factor(dt_woe_list$train$radiant_win)
dt_woe_list$test$radiant_win <- as.factor(dt_woe_list$test$radiant_win)
m1 = glm(radiant_win~ ., family = binomial(), data = dt_woe_list$train)
m_step = step(m1, direction="both", trace = FALSE)
m2 = eval(m_step$call)
summary(m_step)
##
## Call:
## glm(formula = radiant_win ~ win_rate.Radiant_woe + win_rate.Dire_woe +
## gold_t_0_woe + xp_t_0_woe + gold_t_1_woe + xp_t_1_woe + gold_t_2_woe +
## xp_t_2_woe + gold_t_3_woe + xp_t_3_woe + gold_t_4_woe + xp_t_4_woe +
## gold_t_128_woe + xp_t_128_woe + gold_t_129_woe + xp_t_129_woe +
## gold_t_130_woe + xp_t_130_woe + gold_t_131_woe + xp_t_131_woe +
## gold_t_132_woe + xp_t_132_woe, family = binomial(), data = dt_woe_list$train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5634 -1.0815 0.5589 1.0456 2.5391
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.07416 0.01239 5.986 2.16e-09 ***
## win_rate.Radiant_woe 0.92490 0.15003 6.165 7.06e-10 ***
## win_rate.Dire_woe 0.80654 0.15385 5.242 1.59e-07 ***
## gold_t_0_woe 1.05940 0.07883 13.439 < 2e-16 ***
## xp_t_0_woe 0.53116 0.10441 5.087 3.63e-07 ***
## gold_t_1_woe 1.13579 0.08346 13.609 < 2e-16 ***
## xp_t_1_woe 0.55609 0.11313 4.915 8.86e-07 ***
## gold_t_2_woe 1.03457 0.08585 12.051 < 2e-16 ***
## xp_t_2_woe 0.67513 0.11087 6.089 1.13e-09 ***
## gold_t_3_woe 0.92653 0.08309 11.151 < 2e-16 ***
## xp_t_3_woe 0.64337 0.10425 6.172 6.76e-10 ***
## gold_t_4_woe 1.07672 0.08414 12.796 < 2e-16 ***
## xp_t_4_woe 0.50889 0.10364 4.910 9.10e-07 ***
## gold_t_128_woe 0.98453 0.07465 13.188 < 2e-16 ***
## xp_t_128_woe 0.65225 0.09427 6.919 4.56e-12 ***
## gold_t_129_woe 1.13322 0.08119 13.957 < 2e-16 ***
## xp_t_129_woe 0.44481 0.11304 3.935 8.32e-05 ***
## gold_t_130_woe 0.93795 0.08254 11.363 < 2e-16 ***
## xp_t_130_woe 0.72167 0.09541 7.564 3.91e-14 ***
## gold_t_131_woe 0.91588 0.08632 10.610 < 2e-16 ***
## xp_t_131_woe 0.68305 0.10549 6.475 9.49e-11 ***
## gold_t_132_woe 1.00633 0.08399 11.982 < 2e-16 ***
## xp_t_132_woe 0.61879 0.10536 5.873 4.27e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41510 on 29973 degrees of freedom
## Residual deviance: 37430 on 29951 degrees of freedom
## AIC: 37476
##
## Number of Fisher Scoring iterations: 4
模型评估
pred_list = lapply(dt_woe_list, function(x) predict(m2, x, type='response'))
## performance
label_list$train <- (as.numeric(label_list$train))
label_list$test <- (as.numeric(label_list$test))
pred_list$test<- as.numeric(pred_list$test)
perf = scorecard::perf_eva(pred = pred_list$train, label = label_list$train,show_plot = c('ks', 'lift', 'gain', 'roc', 'lz', 'pr', 'f1', 'density'),confusion_matrix = T)
## [INFO] The threshold of confusion matrix is 0.3441.
赛中前10分钟的模型,预测效果要差一点,KS为0.30,AUC为0.7044
有两类预测模型,1, 在比赛之前进行的预测,在这个时候能够使用到的数据有。可以使用的数据包括选手历史数据,比如选手的历史胜率,选手历史平均击杀数,历史平均获取金钱数,历史平均死亡数,历史平均助攻数等等,比赛之前的预测就只能使用一些历史数据 2. 就是在比赛进行的时候进行预测,这种预测需要选一个时间点,根据比赛时常的分布,游戏平均时长为41分钟,可以选择比赛前十分钟为一个时间点,利用十分钟时候的是一个比赛数据为特征,这些特征可以包括,各个位置的金钱数量,经验数量,击杀数,死亡数,助攻数,等等数据进行建模。
有了数据之后需要判断特征与预测结果的关系,有些特征有预测能力,有些特征没有预测能力。有预测能力的特征包括金钱数量,经验数量。这个很容易理解,经验,金钱高的一方更有优势。其他还包括,击杀数量。筛选特征有一些方法,本文使用的指标是IV值。当然还有一些其他指标。或者可以使用机器学习的方法进行特征选择。
将历史比赛的数据划分成为两份,一份用于建立模型,一份用于测试模型
关键点: