You know what’s Dota right? DotA stands for Defends of the Ancients. Dota 2 is a multiplayer online battle arena (MOBA) video game developed and published by Valve. Dota 2 is played in matches between two teams (called Radiant and Dire) of five players, with each team occupying and defending their own separate base on the map. Each of the ten players independently controls a powerful character, known as a hero. A team wins by being the first to destroy the other team’s Ancient, a large structure located within their base.
This project is based on an old Kaggle competition. The training set consists of matches, for which all of the ingame events (like kills, item purchase etc.) as well as match outcomes are known. We are given only the first 5 minutes of each match and need to predict the likelihood of Radiant victory. You can found all the datasets and the competition here. As you can see from the kaggle page, there’s only 2 notebooks and it doesn’t tell that much. So we’re going to build the prediction from scratch.
You can load the package into your workspace using the library() function
Crystal Maiden image by: chroneco
The Kaggle provide 5 datasets; test, train, hero_names, item_ids, and submission example. since i’m not gonna submit my prediction to the late competition, i don’t need to load the submission. The train dataset contain 409794 obs and 103 variables (one target label) so we have 102 variables to predict which team is winning. The test contain 71772 obs and we also did not need that because i’m not submit my result.
## Observations: 409,794
## Variables: 103
## $ match_id <int> 1956782143, 1956782359, 1956782465, 195...
## $ r1_hero <int> 106, 112, 46, 35, 68, 62, 8, 53, 7, 12,...
## $ r1_level <int> 3, 0, 2, 0, 0, 1, 0, 0, 2, 0, 1, 0, 0, ...
## $ r1_xp <int> 2060, 747, 1839, 1681, 445, 953, 1737, ...
## $ r1_gold <int> 1864, 1117, 1477, 1312, 613, 1120, 1439...
## $ r1_lh <int> 28, 0, 13, 14, 3, 2, 22, 7, 19, 27, 9, ...
## $ r1_kills <int> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ r1_deaths <int> 7, 8, 0, 12, 9, 7, 3, 8, 3, 4, 6, 3, 20...
## $ r1_items <int> 6, 13, 10, 8, 6, 11, 12, 4, 8, 10, 9, 8...
## $ r2_hero <int> 8, 50, 6, 75, 107, 13, 80, 11, 14, 21, ...
## $ r2_level <int> 2, 0, 1, 0, 0, 1, 1, 0, 1, 0, 2, 0, 0, ...
## $ r2_xp <int> 1621, 1106, 1121, 1099, 1454, 768, 1471...
## $ r2_gold <int> 1040, 1330, 949, 1394, 980, 714, 1050, ...
## $ r2_lh <int> 20, 2, 11, 11, 2, 5, 13, 10, 6, 1, 9, 1...
## $ r2_kills <int> 0, 3, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ r2_deaths <int> 3, 1, 2, 3, 9, 6, 4, 15, 9, 8, 6, 1, 12...
## $ r2_items <int> 7, 10, 10, 9, 7, 9, 9, 7, 2, 10, 6, 10,...
## $ r3_hero <int> 62, 1, 2, 43, 65, 11, 25, 55, 73, 11, 7...
## $ r3_level <int> 1, 0, 0, 0, 0, 1, 0, 0, 2, 0, 2, 0, 0, ...
## $ r3_xp <int> 642, 1344, 1705, 1210, 763, 1877, 1834,...
## $ r3_gold <int> 1297, 1707, 1911, 1269, 819, 1752, 1319...
## $ r3_lh <int> 1, 21, 20, 6, 10, 15, 17, 12, 13, 19, 1...
## $ r3_kills <int> 2, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ r3_deaths <int> 8, 3, 2, 1, 8, 3, 10, 9, 7, 5, 11, 1, 9...
## $ r3_items <int> 13, 7, 4, 9, 5, 8, 9, 9, 3, 9, 10, 6, 8...
## $ r4_hero <int> 95, 11, 45, 83, 81, 99, 7, 26, 59, 51, ...
## $ r4_level <int> 2, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, ...
## $ r4_xp <int> 818, 2217, 946, 1096, 1920, 1340, 433, ...
## $ r4_gold <int> 960, 1867, 984, 671, 1788, 1320, 746, 5...
## $ r4_lh <int> 14, 25, 3, 0, 29, 20, 0, 1, 16, 24, 1, ...
## $ r4_kills <int> 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...
## $ r4_deaths <int> 10, 3, 2, 8, 9, 7, 6, 15, 6, 9, 12, 5, ...
## $ r4_items <int> 8, 7, 10, 7, 9, 8, 15, 10, 7, 12, 7, 6,...
## $ r5_hero <int> 6, 69, 32, 97, 21, 18, 50, 21, 25, 50, ...
## $ r5_level <int> 2, 0, 1, 0, 0, 1, 0, 0, 2, 0, 1, 0, 0, ...
## $ r5_xp <int> 1678, 1810, 876, 994, 1875, 1994, 569, ...
## $ r5_gold <int> 1391, 904, 726, 678, 1256, 1778, 1048, ...
## $ r5_lh <int> 16, 8, 4, 3, 16, 30, 0, 13, 7, 0, 14, 1...
## $ r5_kills <int> 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
## $ r5_deaths <int> 9, 1, 3, 7, 6, 2, 7, 6, 5, 2, 8, 11, 7,...
## $ r5_items <int> 6, 5, 5, 6, 7, 9, 10, 8, 5, 12, 8, 10, ...
## $ d1_hero <int> 37, 25, 7, 50, 11, 28, 59, 28, 35, 73, ...
## $ d1_level <int> 2, 0, 1, 0, 0, 1, 0, 0, 2, 0, 1, 0, 0, ...
## $ d1_xp <int> 1686, 1671, 1115, 1027, 2200, 1181, 186...
## $ d1_gold <int> 1282, 1109, 1130, 952, 1677, 1280, 1613...
## $ d1_lh <int> 16, 15, 9, 2, 25, 14, 24, 10, 20, 19, 1...
## $ d1_kills <int> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 1, ...
## $ d1_deaths <int> 8, 5, 5, 4, 5, 10, 1, 12, 4, 5, 5, 0, 7...
## $ d1_items <int> 9, 8, 7, 12, 8, 9, 6, 8, 7, 8, 6, 6, 7,...
## $ d2_hero <int> 49, 30, 106, 11, 57, 90, 87, 20, 56, 39...
## $ d2_level <int> 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, ...
## $ d2_xp <int> 1141, 963, 893, 1831, 614, 924, 712, 11...
## $ d2_gold <int> 1078, 500, 1066, 1386, 500, 907, 887, 1...
## $ d2_lh <int> 14, 0, 9, 16, 0, 4, 1, 3, 7, 6, 9, 13, ...
## $ d2_kills <int> 0, 0, 0, 1, 0, 0, 1, 1, 1, 2, 0, 0, 2, ...
## $ d2_deaths <int> 5, 6, 6, 0, 6, 4, 2, 8, 10, 5, 4, 1, 4,...
## $ d2_items <int> 4, 7, 6, 7, 7, 8, 12, 9, 11, 9, 5, 9, 1...
## $ d3_hero <int> 111, 27, 39, 26, 28, 100, 38, 81, 74, 6...
## $ d3_level <int> 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 2, 0, 0, ...
## $ d3_xp <int> 963, 414, 1661, 1015, 1153, 1749, 366, ...
## $ d3_gold <int> 641, 562, 935, 1268, 971, 1421, 1224, 1...
## $ d3_lh <int> 1, 1, 10, 0, 10, 12, 0, 19, 10, 2, 13, ...
## $ d3_kills <int> 0, 0, 0, 2, 0, 0, 2, 1, 0, 0, 0, 0, 1, ...
## $ d3_deaths <int> 10, 6, 1, 1, 4, 9, 5, 7, 5, 10, 1, 10, ...
## $ d3_items <int> 10, 9, 5, 12, 9, 7, 7, 14, 8, 11, 6, 6,...
## $ d4_hero <int> 98, 106, 50, 67, 86, 53, 69, 8, 30, 100...
## $ d4_level <int> 3, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, ...
## $ d4_xp <int> 2071, 1237, 1111, 855, 487, 898, 1903, ...
## $ d4_gold <int> 1377, 1802, 690, 1032, 500, 1035, 1474,...
## $ d4_lh <int> 19, 31, 4, 11, 0, 13, 17, 9, 3, 6, 2, 1...
## $ d4_kills <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ d4_deaths <int> 11, 3, 9, 0, 12, 1, 4, 3, 8, 12, 7, 11,...
## $ d4_items <int> 6, 10, 10, 5, 9, 5, 9, 7, 4, 4, 10, 10,...
## $ d5_hero <int> 83, 2, 73, 71, 8, 21, 21, 57, 71, 87, 2...
## $ d5_level <int> 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, ...
## $ d5_xp <int> 293, 1487, 1002, 1536, 1122, 1728, 1768...
## $ d5_gold <int> 635, 1350, 1905, 1379, 1924, 1264, 1709...
## $ d5_lh <int> 3, 17, 11, 14, 34, 16, 16, 4, 8, 3, 9, ...
## $ d5_kills <int> 0, 0, 0, 1, 0, 0, 2, 2, 0, 0, 0, 0, 1, ...
## $ d5_deaths <int> 7, 12, 5, 1, 5, 5, 3, 1, 1, 8, 5, 0, 5,...
## $ d5_items <int> 7, 8, 5, 7, 10, 8, 9, 8, 4, 10, 3, 6, 9...
## $ first_blood_time <dbl> 12, -14, 93, 4, 130, 149, 18, 97, 214, ...
## $ first_blood_team <dbl> 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, ...
## $ first_blood_player1 <dbl> 2, 3, 3, 7, 1, 2, 7, 8, 6, 6, 5, 0, 6, ...
## $ first_blood_player2 <dbl> 7, 6, 6, 3, 8, 9, 2, 4, 2, 3, 3, 5, 4, ...
## $ radiant_bottle_time <dbl> 109, 70, 161, -32, 190, 151, 238, 191, ...
## $ radiant_courier_time <dbl> -64, -77, -69, -69, -84, -71, -85, -86,...
## $ radiant_flying_courier_time <dbl> 223, 184, 204, 202, 272, NA, 269, 224, ...
## $ radiant_tpscroll_count <int> 1, 3, 3, 5, 0, 2, 6, 5, 2, 7, 2, 0, 4, ...
## $ radiant_boots_count <int> 3, 3, 2, 3, 4, 5, 3, 2, 4, 3, 4, 4, 2, ...
## $ radiant_ward_observer_count <int> 2, 2, 3, 1, 2, 2, 2, 3, 1, 4, 1, 2, 1, ...
## $ radiant_ward_sentry_count <int> 0, 1, 0, 0, 1, 1, 1, 1, 0, 2, 0, 0, 0, ...
## $ radiant_first_ward_time <dbl> 61, 177, -25, -6, 5, -6, -21, -20, -5, ...
## $ dire_bottle_time <dbl> 144, 135, 48, 125, 97, 154, 162, NA, NA...
## $ dire_courier_time <dbl> -71, -81, -72, -87, -84, -68, -44, -75,...
## $ dire_flying_courier_time <dbl> 184, 294, 219, 209, NA, 186, NA, 182, 2...
## $ dire_tpscroll_count <int> 2, 7, 4, 3, 3, 3, 2, 2, 2, 2, 3, 1, 1, ...
## $ dire_boots_count <int> 2, 1, 2, 5, 3, 2, 5, 3, 5, 4, 5, 2, 4, ...
## $ dire_ward_observer_count <int> 2, 3, 2, 3, 2, 2, 3, 2, 1, 3, 1, 2, 2, ...
## $ dire_ward_sentry_count <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, ...
## $ dire_first_ward_time <dbl> 32, -24, -20, 35, -42, -8, 88, 71, NA, ...
## $ duration <int> 4001, 2628, 1721, 1965, 2522, 2936, 157...
## $ radiant_win <fct> True, True, True, False, False, False, ...
If you have no idea what those variable is all about, here’s the variable’s descriptions: - match_id: unique id for every match
- note r1-r5 mean player 1-5 in team Radiant, and d1-d5 mean player 1-5 in team Dire
- r1_hero : player’s hero (mapping can be found in hero_names.json)
- r1_level : maximum hero level reached (by the first 5 minutes of the game)
- r1_xp : maximum experience gained
- r1_gold : amount of gold earned
- r1_lh : last hits, number of creeps killed
- r1_kills : number of players killed
- r1_deaths : the number of deaths for the hero
- r1_items : the number of items purchased
- note: If the “first blood” event did not have time to occur in the first 5 minutes, the column contains a missing value.
- first_blood_time : time for the first blood (first_blood = first death/kill in the game)
- first_blood_team : the team that committed the first blood (0 - Radiant, 1 - Dire)
- first_blood_player1 : index of player who got the kill
- first_blood_player2 : index of player who got killed
- note: Features for both teams (prefixes radiant_ and dire_ ):
- radiant_bottle_time : the time the team first purchased the item “bottle”
- radiant_courier_time : acquisition time of the “courier” item
- radiant_flying_courier_time : acquisition time of the “flying_courier” item
- radiant_tpscroll_count : the number of tpscroll items bought in the first 5 minutes
- radiant_boots_count : the number of “boots” for the team in the first 5 minutes
- radiant_ward_observer_count : the number of “ward_observer” items
- radiant_ward_sentry_count : the number of “ward_sentry” items
- radiant_first_ward_time : the time for the first placed ward for the team
- radiant_win : True, if the Radiant team won, False - otherwise (this is or target variable)
There’s so much to do with data wrangling. as you can see from the glimpse above, there’s lots of variables that seems redundant. we’ll convert them to as few as possible without reducing its meaning
# i found 1 row in column first_blood_player_2 column contain -1. maybe its a typo. i'll change it to 0
train1$first_blood_player2[train1$first_blood_player2 == -1] <- 0
i <- c(1,2,9,10,17,18,25,26,33,34,41,42,49,50,57,58,65,66,73,74,81,83,84,85)
train1.edit <- train1
# change some column to factor
train1.edit[,i] <- lapply(train1.edit[,i], as.factor)
# i also found missing value in target vaiable (only 8 rows). I remove that rows and drop the factor levels
train1.edit <- train1.edit[!(train1.edit$radiant_win==""),] %>% droplevels()# check proportion of target variable. make sure its evenly distributed
table(train1.edit$radiant_win) %>% prop.table()##
## False True
## 0.4823835 0.5176165
it’s good to go
The dataset is way too large for my potato laptop, so i need to split it only to 4k obs
I’ll combine some numeric data per team
# r1-r5 stands for radiant player 1-5, and d1-d5 for dire player 1-5
train1.edit <- train1.edit %>%
mutate(
radiant_gold = as.integer(r1_gold + r2_gold + r3_gold + r4_gold + r5_gold),
radiant_xp = as.integer(r1_xp + r2_xp + r3_xp + r4_xp + r5_xp),
radiant_level = as.integer(r1_level + r2_level + r3_level + r4_level + r5_level),
radiant_lh = as.integer(r1_lh + r2_lh + r3_lh + r4_lh + r5_lh),
radiant_kills = as.integer(r1_kills + r2_kills + r3_kills + r4_kills + r5_kills),
radiant_deaths = as.integer(r1_deaths + r2_deaths + r3_deaths + r4_deaths + r5_deaths),
dire_gold = as.integer(d1_gold + d2_gold + d3_gold + d4_gold + d5_gold),
dire_xp = as.integer(d1_xp + d2_xp + d3_xp + d4_xp + d5_xp),
dire_level = as.integer(d1_level + d2_level + d3_level + d4_level + d5_level),
dire_lh = as.integer(d1_lh + d2_lh + d3_lh + d4_lh + d5_lh),
dire_kills = as.integer(d1_kills + d2_kills + d3_kills + d4_kills + d5_kills),
dire_deaths = as.integer(d1_deaths + d2_deaths + d3_deaths + d4_deaths + d5_deaths)
)
train1x <- train1.edit[,1:81] %>%
select_if(Negate(is.integer))
train1.new <- cbind(train1x, train1.edit[,82:115])
# +90 means i add additional 90 seconds to the time variable to avoid negative occurrences in data.
# In dota2 game, the game actually start from -1.30 minutes for preparation.
# The minus in raw data tells the event happen before game started in 00.00
train1.new[,c(22,26,27,28,33,34,35,36,41,42)] <- train1.new[,c(22,26,27,28,33,34,35,36,41,42)] + 90NA in time variable value is kinda tricky. if you fill NA with 0 it’ll assume the event occurs in second 0. we don’t want it to happen. I convert all the time (second in this case) variable to categoric value. The additional levels are meant to NA value, indiciting certain events do not happen
labels.time <- c("1","2","3","4")
train1.new <- train1.new %>% mutate(
first.blood.time = cut(
train1.new$first_blood_time,
breaks = c(-Inf,103,171,232,Inf),
labels = labels.time),
radiant.bottle.time = cut(
train1.new$radiant_bottle_time,
breaks = c(-Inf, 166, 219, 257, Inf),
labels = labels.time),
radiant.courier.time = cut(
train1.new$radiant_courier_time,
breaks = c(-Inf, 7, 16, 27, Inf),
labels = labels.time),
radiant.fly.courier.time = cut(
train1.new$radiant_flying_courier_time,
breaks = c(-Inf, 278, 301, 311, Inf),
labels = labels.time),
radiant.first.ward.time = cut(
train1.new$radiant_first_ward_time,
breaks = c(-Inf, 68, 90, 150, Inf),
labels = labels.time),
dire.bottle.time = cut(
train1.new$dire_bottle_time,
breaks = c(-Inf, 166, 219, 257, Inf),
labels = labels.time),
dire.courier.time = cut(
train1.new$dire_courier_time,
breaks = c(-Inf, 7, 16, 27, Inf),
labels = labels.time),
dire.fly.courier.time = cut(
train1.new$dire_flying_courier_time,
breaks = c(-Inf, 278, 301, 311, Inf),
labels = labels.time),
dire.first.ward.time = cut(
train1.new$dire_first_ward_time,
breaks = c(-Inf, 68, 90, 150, Inf),
labels = labels.time)
)
# In Dota terms, we know that the duration of the game are usually splitted into 3 terms: fast game, normal game, and late game.
# I convert the game duration into 3 categories: fast for min ~ 42 minutes, normal for 42 ~ 49 minutes, and late for 50 ~ max minutes.
train1.new <- train1.new %>% mutate(
duration.c = cut(
train1.new$duration/60,
breaks = c(-Inf, 42, 49, Inf),
labels = c("fast","normal","late")
)
)Lets see if we correctly convert numeric value to factor
# fill NA = 5 to time variable. Level 5 means the team doesn't do the 'things'
# its hard for me to say it in english so here's an example: if first.blood.time levels = 5,
# it means that match doesn't have any first blood in the frst 5 minutes
train1.x[46:54] <- lapply(train1.x[46:54], function(x){
x <- factor(x, exclude = NULL)
levels(x)[is.na(levels(x))] <- "5"
return(x)
})
## for first.blood.team NA, i'll fill "3", means that match doesn't have any first blood in first 5 minutes
levels(train1.x$first_blood_team) <- c(levels(train1.x$first_blood_team), "3")
train1.x$first_blood_team[which(is.na(train1.x$first_blood_team))] <- "3"
## and for first.blood.player NA, i'll fill "10", means that team doesn't have any first blodd
train1.x[23:24] <- lapply(train1.x[23:24], function(x){
x <- factor(x, exclude = NULL)
levels(x)[is.na(levels(x))] <- "10"
return(x)
})We know that radiant/dire tp,boots,wards count have a different range with radiant/dire gold,exp,kills, etc. so we need to perform normalization in order to re-sclae them into one same range
Now come the hardest part (for me). I actually spend a lot of time to figure out how we change hero to have fewer levels. Theres 117 heroes in Dota and this whole dataset using 115 of them. if we put 115 levels of factor i’m certanily sure it’ll broken the model. Good thing the competition provide heroes detail in json data. I convert every used heroes per team per match to its roles.
# select used hero variable in main dataframe
hero.use <- train1.x[,grepl("_hero", names(train1.x))]
# convert list (from json) to dataframe
hero.df2 <- rbindlist(hero.name)
hero.df2$roles <- as.factor(hero.df2$roles)
# In dota2, hero roles are splitted to 9 categories: carry, disabler, durable, escape, initiator, jungler, nuker, pusher, and support.
# one hero can have many roles, for example `Anti Mage` hero id `1` have Carry, Escape and Nuker
# group the hero by its id, then combine all the roles into per roles category.
hero.df2x <- hero.df2 %>% group_by(id) %>%
summarise(roles.c = paste(roles, collapse = " "))
hero.df2x$id <- as.factor(hero.df2x$id)
# change all the id in hero.use to its matching roles
use.cate <- apply(hero.use, c(1,2), function(x) hero.df2x[hero.df2x$id == x, 2])
use.cate <- as.data.frame(matrix(unlist(use.cate),nrow = 4000), stringsAsFactors = F)Then make new column correspond to used hero roles from each team. for example: r.carry means total number of heroes in radiant team that have a carry role (please keep in mind, one hero can have many roles)
# hero roles for radiant team per match
use.cate$r.carry <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
%in% "carry"))
use.cate$r.disabler <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
%in% "disabler"))
use.cate$r.durable <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
%in% "durable"))
use.cate$r.escape <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
%in% "escape"))
use.cate$r.initiator <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
%in% "initiator"))
use.cate$r.jungler <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
%in% "jungler"))
use.cate$r.nuker <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
%in% "nuker"))
use.cate$r.pusher <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
%in% "pusher"))
use.cate$r.support <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
%in% "support"))
# hero roles for dire team per match
use.cate$d.carry <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
%in% "carry"))
use.cate$d.disabler <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
%in% "disabler"))
use.cate$d.durable <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
%in% "durable"))
use.cate$d.escape <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
%in% "escape"))
use.cate$d.initiator <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
%in% "initiator"))
use.cate$d.jungler <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
%in% "jungler"))
use.cate$d.nuker <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
%in% "nuker"))
use.cate$d.pusher <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
%in% "pusher"))
use.cate$d.support <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
%in% "support"))let’s take a quick look to our hero data
## Observations: 4,000
## Variables: 28
## $ V1 <chr> "Carry Escape Nuker Disabler Initiator", "Carry Initiat...
## $ V2 <chr> "Carry Nuker", "Carry Initiator Durable Disabler Nuker"...
## $ V3 <chr> "Nuker Durable Escape", "Support Disabler Nuker", "Carr...
## $ V4 <chr> "Support Disabler Nuker", "Carry Nuker Pusher", "Carry ...
## $ V5 <chr> "Support Disabler Nuker Durable", "Disabler Initiator D...
## $ V6 <chr> "Carry Support Disabler Escape Nuker", "Support Disable...
## $ V7 <chr> "Carry Durable Initiator", "Initiator Jungler Escape Di...
## $ V8 <chr> "Initiator Disabler Nuker", "Carry Disabler Initiator D...
## $ V9 <chr> "Carry Initiator Disabler Durable Escape", "Carry Escap...
## $ V10 <chr> "Support Carry Durable", "Carry Nuker Pusher Initiator ...
## $ r.carry <int> 2, 3, 4, 3, 4, 2, 2, 3, 2, 3, 3, 3, 2, 4, 2, 4, 2, 2, 3...
## $ r.disabler <int> 3, 4, 3, 2, 4, 4, 4, 4, 5, 5, 3, 5, 5, 5, 3, 2, 5, 3, 4...
## $ r.durable <int> 2, 3, 2, 0, 2, 3, 1, 2, 2, 2, 1, 2, 2, 4, 0, 3, 2, 1, 1...
## $ r.escape <int> 2, 1, 2, 2, 3, 3, 3, 1, 1, 1, 4, 2, 0, 1, 3, 2, 1, 1, 3...
## $ r.initiator <int> 1, 3, 1, 0, 2, 2, 2, 2, 3, 1, 2, 2, 4, 4, 0, 3, 2, 3, 2...
## $ r.jungler <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 1, 0...
## $ r.nuker <int> 5, 4, 3, 3, 5, 3, 5, 5, 5, 5, 3, 2, 4, 3, 5, 3, 3, 3, 4...
## $ r.pusher <int> 0, 1, 3, 2, 2, 0, 2, 1, 1, 0, 0, 2, 0, 1, 1, 0, 0, 0, 0...
## $ r.support <int> 2, 1, 1, 2, 0, 3, 2, 0, 2, 4, 2, 2, 3, 1, 1, 0, 2, 2, 3...
## $ d.carry <int> 4, 3, 1, 2, 2, 4, 2, 4, 3, 3, 1, 3, 4, 3, 4, 5, 5, 2, 3...
## $ d.disabler <int> 3, 5, 4, 4, 4, 3, 3, 4, 4, 2, 3, 2, 4, 4, 5, 5, 2, 4, 1...
## $ d.durable <int> 3, 2, 2, 3, 1, 1, 2, 2, 2, 2, 3, 2, 3, 2, 1, 2, 2, 2, 1...
## $ d.escape <int> 2, 2, 0, 4, 3, 2, 0, 3, 3, 0, 1, 2, 2, 1, 3, 2, 2, 1, 4...
## $ d.initiator <int> 3, 3, 2, 4, 3, 2, 2, 2, 1, 2, 2, 1, 2, 2, 0, 3, 1, 3, 1...
## $ d.jungler <int> 0, 1, 1, 0, 1, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0...
## $ d.nuker <int> 2, 3, 4, 4, 5, 4, 4, 2, 4, 3, 4, 4, 3, 4, 4, 4, 3, 3, 3...
## $ d.pusher <int> 0, 1, 2, 0, 1, 2, 0, 1, 1, 0, 1, 1, 0, 0, 0, 2, 3, 1, 1...
## $ d.support <int> 2, 1, 2, 0, 3, 2, 2, 2, 1, 1, 1, 2, 1, 2, 3, 2, 0, 3, 1...
# bind to main dataframe and drop unnecessary variables
use.cate <- use.cate %>% select(11:28)
train1.x <- cbind(train1.x,use.cate)
train1.x <- train1.x %>% select(-c(2,4,6,8,10,12,14,16,18,20))Note: Items is one of very usefull components in Dota. But in this case i’m gonna exculude them for modelling since its contain so many levels per column. i’m afraid it will damage the model. As far as i can’t solve that problem, i assume items are not part of this game
Finnaly, we can re-check our preprocessed data before modelling
## Observations: 4,000
## Variables: 53
## $ match_id <fct> 2000270094, 1964538474, 1962513300, 198...
## $ first_blood_team <fct> 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 3, ...
## $ first_blood_player1 <fct> 4, 2, 7, 1, 0, 5, 0, 2, 5, 1, 2, 1, 10,...
## $ first_blood_player2 <fct> 7, 6, 0, 6, 7, 4, 6, 8, 2, 5, 7, 5, 10,...
## $ radiant_tpscroll_count <dbl> 0.23076923, 0.15384615, 0.15384615, 0.3...
## $ radiant_boots_count <dbl> 0.7142857, 0.4285714, 0.1428571, 0.7142...
## $ radiant_ward_observer_count <dbl> 0.1666667, 0.3333333, 0.3333333, 0.3333...
## $ radiant_ward_sentry_count <dbl> 0.125, 0.125, 0.000, 0.125, 0.000, 0.12...
## $ dire_tpscroll_count <dbl> 0.41666667, 0.25000000, 0.16666667, 0.3...
## $ dire_boots_count <dbl> 0.2857143, 0.4285714, 0.5714286, 0.5714...
## $ dire_ward_observer_count <dbl> 0.5000000, 0.0000000, 0.1666667, 0.3333...
## $ dire_ward_sentry_count <dbl> 0.00000000, 0.00000000, 0.00000000, 0.0...
## $ radiant_win <fct> False, False, False, True, True, True, ...
## $ radiant_gold <dbl> 0.6481314, 0.6357171, 0.3212207, 0.6487...
## $ radiant_xp <dbl> 0.5800796, 0.7555334, 0.5527232, 0.7387...
## $ radiant_level <dbl> 0.00000000, 0.00000000, 0.26666667, 0.6...
## $ radiant_lh <dbl> 0.4141414, 0.5858586, 0.1818182, 0.6161...
## $ radiant_kills <dbl> 0.36363636, 0.18181818, 0.00000000, 0.1...
## $ radiant_deaths <dbl> 0.3846154, 0.4285714, 0.4835165, 0.7142...
## $ dire_gold <dbl> 0.4851485, 0.4633663, 0.4905941, 0.5230...
## $ dire_xp <dbl> 0.4819911, 0.6572933, 0.6180448, 0.6528...
## $ dire_level <dbl> 0.0000000, 0.0000000, 0.3333333, 0.4666...
## $ dire_lh <dbl> 0.4411765, 0.4705882, 0.3529412, 0.3725...
## $ dire_kills <dbl> 0.00000000, 0.00000000, 0.08333333, 0.0...
## $ dire_deaths <dbl> 0.5402299, 0.3563218, 0.4482759, 0.5517...
## $ first.blood.time <fct> 1, 1, 4, 1, 2, 3, 3, 2, 3, 2, 3, 4, 5, ...
## $ radiant.bottle.time <fct> 2, 1, 2, 2, 1, 1, 5, 5, 1, 3, 3, 1, 1, ...
## $ radiant.courier.time <fct> 1, 3, 2, 3, 3, 3, 1, 4, 2, 1, 1, 4, 1, ...
## $ radiant.fly.courier.time <fct> 4, 1, 4, 2, 5, 5, 3, 5, 5, 4, 2, 4, 1, ...
## $ radiant.first.ward.time <fct> 3, 2, 1, 3, 1, 2, 1, 5, 2, 1, 2, 5, 2, ...
## $ dire.bottle.time <fct> 5, 4, 3, 4, 4, 5, 3, 2, 5, 2, 1, 5, 2, ...
## $ dire.courier.time <fct> 1, 4, 1, 4, 4, 3, 1, 3, 1, 1, 1, 3, 1, ...
## $ dire.fly.courier.time <fct> 4, 5, 4, 5, 5, 1, 4, 5, 3, 5, 4, 5, 1, ...
## $ dire.first.ward.time <fct> 4, 5, 1, 3, 3, 3, 1, 2, 1, 1, 1, 3, 2, ...
## $ duration.c <fct> late, late, late, late, normal, fast, f...
## $ r.carry <int> 2, 3, 4, 3, 4, 2, 2, 3, 2, 3, 3, 3, 2, ...
## $ r.disabler <int> 3, 4, 3, 2, 4, 4, 4, 4, 5, 5, 3, 5, 5, ...
## $ r.durable <int> 2, 3, 2, 0, 2, 3, 1, 2, 2, 2, 1, 2, 2, ...
## $ r.escape <int> 2, 1, 2, 2, 3, 3, 3, 1, 1, 1, 4, 2, 0, ...
## $ r.initiator <int> 1, 3, 1, 0, 2, 2, 2, 2, 3, 1, 2, 2, 4, ...
## $ r.jungler <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ r.nuker <int> 5, 4, 3, 3, 5, 3, 5, 5, 5, 5, 3, 2, 4, ...
## $ r.pusher <int> 0, 1, 3, 2, 2, 0, 2, 1, 1, 0, 0, 2, 0, ...
## $ r.support <int> 2, 1, 1, 2, 0, 3, 2, 0, 2, 4, 2, 2, 3, ...
## $ d.carry <int> 4, 3, 1, 2, 2, 4, 2, 4, 3, 3, 1, 3, 4, ...
## $ d.disabler <int> 3, 5, 4, 4, 4, 3, 3, 4, 4, 2, 3, 2, 4, ...
## $ d.durable <int> 3, 2, 2, 3, 1, 1, 2, 2, 2, 2, 3, 2, 3, ...
## $ d.escape <int> 2, 2, 0, 4, 3, 2, 0, 3, 3, 0, 1, 2, 2, ...
## $ d.initiator <int> 3, 3, 2, 4, 3, 2, 2, 2, 1, 2, 2, 1, 2, ...
## $ d.jungler <int> 0, 1, 1, 0, 1, 1, 0, 2, 0, 0, 0, 1, 0, ...
## $ d.nuker <int> 2, 3, 4, 4, 5, 4, 4, 2, 4, 3, 4, 4, 3, ...
## $ d.pusher <int> 0, 1, 2, 0, 1, 2, 0, 1, 1, 0, 1, 1, 0, ...
## $ d.support <int> 2, 1, 2, 0, 3, 2, 2, 2, 1, 1, 1, 2, 1, ...
## match_id first_blood_team
## 0 0
## first_blood_player1 first_blood_player2
## 0 0
## radiant_tpscroll_count radiant_boots_count
## 0 0
## radiant_ward_observer_count radiant_ward_sentry_count
## 0 0
## dire_tpscroll_count dire_boots_count
## 0 0
## dire_ward_observer_count dire_ward_sentry_count
## 0 0
## radiant_win radiant_gold
## 0 0
## radiant_xp radiant_level
## 0 0
## radiant_lh radiant_kills
## 0 0
## radiant_deaths dire_gold
## 0 0
## dire_xp dire_level
## 0 0
## dire_lh dire_kills
## 0 0
## dire_deaths first.blood.time
## 0 0
## radiant.bottle.time radiant.courier.time
## 0 0
## radiant.fly.courier.time radiant.first.ward.time
## 0 0
## dire.bottle.time dire.courier.time
## 0 0
## dire.fly.courier.time dire.first.ward.time
## 0 0
## duration.c r.carry
## 0 0
## r.disabler r.durable
## 0 0
## r.escape r.initiator
## 0 0
## r.jungler r.nuker
## 0 0
## r.pusher r.support
## 0 0
## d.carry d.disabler
## 0 0
## d.durable d.escape
## 0 0
## d.initiator d.jungler
## 0 0
## d.nuker d.pusher
## 0 0
## d.support
## 0
I think its clean enough. lets start modelling
Rikimaru image by: chroneco
# drop match id, we dont need it for modeling
train1.clean <- train1.clean[,-1]
# split data to train and test for model evaluation
splitted <- initial_split(train1.clean, prop = 0.8, strata = "radiant_win")
trainer1 <- training(splitted)
tester1 <- testing(splitted)# lets's re-check our target variable and hope its properly distributed
table(trainer1$radiant_win) %>% prop.table()##
## False True
## 0.4850094 0.5149906
##
## Call:
## glm(formula = radiant_win ~ ., family = "binomial", data = trainer1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0755 -0.3244 0.0365 0.3262 3.5204
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.718816 1.471502 0.488 0.625202
## first_blood_team1 -0.046612 0.423357 -0.110 0.912330
## first_blood_team3 -0.442272 0.428499 -1.032 0.302005
## first_blood_player11 0.149778 0.292705 0.512 0.608858
## first_blood_player12 -0.004468 0.310510 -0.014 0.988519
## first_blood_player13 0.203825 0.301222 0.677 0.498622
## first_blood_player14 0.122737 0.294077 0.417 0.676412
## first_blood_player15 -0.208040 0.280786 -0.741 0.458743
## first_blood_player16 -0.332541 0.297536 -1.118 0.263717
## first_blood_player17 0.122867 0.277578 0.443 0.658027
## first_blood_player18 0.466778 0.279362 1.671 0.094747
## first_blood_player19 NA NA NA NA
## first_blood_player110 NA NA NA NA
## first_blood_player21 0.147292 0.303031 0.486 0.626922
## first_blood_player22 0.270640 0.293668 0.922 0.356744
## first_blood_player23 0.233564 0.296682 0.787 0.431133
## first_blood_player24 0.364407 0.299440 1.217 0.223619
## first_blood_player25 -0.205978 0.307756 -0.669 0.503311
## first_blood_player26 -0.350375 0.307651 -1.139 0.254756
## first_blood_player27 -0.181992 0.297735 -0.611 0.541032
## first_blood_player28 -0.176838 0.313295 -0.564 0.572451
## first_blood_player29 NA NA NA NA
## first_blood_player210 NA NA NA NA
## radiant_tpscroll_count -0.518914 0.608382 -0.853 0.393691
## radiant_boots_count -0.282910 0.429236 -0.659 0.509831
## radiant_ward_observer_count -0.177913 0.514200 -0.346 0.729344
## radiant_ward_sentry_count 0.905908 0.889720 1.018 0.308585
## dire_tpscroll_count 0.431231 0.556386 0.775 0.438305
## dire_boots_count -0.103685 0.428183 -0.242 0.808662
## dire_ward_observer_count 0.443668 0.538877 0.823 0.410326
## dire_ward_sentry_count -2.156600 1.560939 -1.382 0.167093
## radiant_gold 5.683899 1.556447 3.652 0.000260
## radiant_xp -1.431557 1.296344 -1.104 0.269462
## radiant_level -0.487327 0.937268 -0.520 0.603102
## radiant_lh 0.342452 1.009183 0.339 0.734357
## radiant_kills -4.284788 0.956614 -4.479 0.00000750
## radiant_deaths -18.312482 0.733115 -24.979 < 0.0000000000000002
## dire_gold -6.617641 1.762258 -3.755 0.000173
## dire_xp -0.034236 1.338015 -0.026 0.979587
## dire_level 0.436975 0.927391 0.471 0.637507
## dire_lh 1.351413 1.064070 1.270 0.204070
## dire_kills 5.272674 1.108351 4.757 0.00000196
## dire_deaths 18.829450 0.752749 25.014 < 0.0000000000000002
## first.blood.time2 -0.228811 0.189707 -1.206 0.227768
## first.blood.time3 -0.193356 0.192778 -1.003 0.315863
## first.blood.time4 -0.269448 0.211740 -1.273 0.203181
## first.blood.time5 NA NA NA NA
## radiant.bottle.time2 -0.028713 0.208533 -0.138 0.890485
## radiant.bottle.time3 -0.161015 0.213359 -0.755 0.450448
## radiant.bottle.time4 -0.235664 0.212732 -1.108 0.267950
## radiant.bottle.time5 0.045880 0.205230 0.224 0.823105
## radiant.courier.time2 -0.122975 0.181937 -0.676 0.499090
## radiant.courier.time3 -0.313223 0.216833 -1.445 0.148588
## radiant.courier.time4 -0.279884 0.227065 -1.233 0.217719
## radiant.courier.time5 -1.168399 0.606636 -1.926 0.054101
## radiant.fly.courier.time2 0.163821 0.251895 0.650 0.515463
## radiant.fly.courier.time3 -0.050559 0.360368 -0.140 0.888424
## radiant.fly.courier.time4 -0.036358 0.219201 -0.166 0.868261
## radiant.fly.courier.time5 -0.039463 0.209061 -0.189 0.850277
## radiant.first.ward.time2 0.426156 0.202550 2.104 0.035382
## radiant.first.ward.time3 0.124885 0.182774 0.683 0.494431
## radiant.first.ward.time4 0.031600 0.261804 0.121 0.903929
## radiant.first.ward.time5 0.100577 0.307240 0.327 0.743398
## dire.bottle.time2 0.328931 0.204196 1.611 0.107211
## dire.bottle.time3 0.247075 0.209482 1.179 0.238217
## dire.bottle.time4 0.032483 0.207279 0.157 0.875471
## dire.bottle.time5 -0.427453 0.202618 -2.110 0.034889
## dire.courier.time2 0.350168 0.187806 1.865 0.062248
## dire.courier.time3 0.419507 0.212843 1.971 0.048728
## dire.courier.time4 0.278563 0.227485 1.225 0.220751
## dire.courier.time5 0.774415 0.591901 1.308 0.190754
## dire.fly.courier.time2 0.250124 0.243268 1.028 0.303865
## dire.fly.courier.time3 0.625430 0.336807 1.857 0.063320
## dire.fly.courier.time4 0.228436 0.213497 1.070 0.284630
## dire.fly.courier.time5 0.592647 0.198576 2.984 0.002841
## dire.first.ward.time2 -0.231559 0.196908 -1.176 0.239606
## dire.first.ward.time3 -0.398541 0.190022 -2.097 0.035964
## dire.first.ward.time4 -0.110378 0.250457 -0.441 0.659425
## dire.first.ward.time5 -0.202947 0.298310 -0.680 0.496301
## duration.cnormal -0.049248 0.174949 -0.282 0.778326
## duration.clate -0.349623 0.180878 -1.933 0.053246
## r.carry 0.272834 0.083140 3.282 0.001032
## r.disabler -0.161057 0.086884 -1.854 0.063781
## r.durable -0.013771 0.078831 -0.175 0.861321
## r.escape -0.130976 0.070386 -1.861 0.062769
## r.initiator 0.022527 0.078665 0.286 0.774600
## r.jungler -0.035087 0.112016 -0.313 0.754105
## r.nuker -0.112139 0.080619 -1.391 0.164236
## r.pusher 0.178244 0.079508 2.242 0.024972
## r.support 0.191040 0.084846 2.252 0.024347
## d.carry -0.023684 0.084487 -0.280 0.779232
## d.disabler 0.165348 0.080708 2.049 0.040490
## d.durable -0.146797 0.076943 -1.908 0.056409
## d.escape 0.072254 0.068421 1.056 0.290958
## d.initiator -0.118486 0.076963 -1.540 0.123678
## d.jungler -0.090622 0.112680 -0.804 0.421257
## d.nuker 0.064229 0.081308 0.790 0.429564
## d.pusher -0.073671 0.081768 -0.901 0.367602
## d.support -0.229808 0.082033 -2.801 0.005088
##
## (Intercept)
## first_blood_team1
## first_blood_team3
## first_blood_player11
## first_blood_player12
## first_blood_player13
## first_blood_player14
## first_blood_player15
## first_blood_player16
## first_blood_player17
## first_blood_player18 .
## first_blood_player19
## first_blood_player110
## first_blood_player21
## first_blood_player22
## first_blood_player23
## first_blood_player24
## first_blood_player25
## first_blood_player26
## first_blood_player27
## first_blood_player28
## first_blood_player29
## first_blood_player210
## radiant_tpscroll_count
## radiant_boots_count
## radiant_ward_observer_count
## radiant_ward_sentry_count
## dire_tpscroll_count
## dire_boots_count
## dire_ward_observer_count
## dire_ward_sentry_count
## radiant_gold ***
## radiant_xp
## radiant_level
## radiant_lh
## radiant_kills ***
## radiant_deaths ***
## dire_gold ***
## dire_xp
## dire_level
## dire_lh
## dire_kills ***
## dire_deaths ***
## first.blood.time2
## first.blood.time3
## first.blood.time4
## first.blood.time5
## radiant.bottle.time2
## radiant.bottle.time3
## radiant.bottle.time4
## radiant.bottle.time5
## radiant.courier.time2
## radiant.courier.time3
## radiant.courier.time4
## radiant.courier.time5 .
## radiant.fly.courier.time2
## radiant.fly.courier.time3
## radiant.fly.courier.time4
## radiant.fly.courier.time5
## radiant.first.ward.time2 *
## radiant.first.ward.time3
## radiant.first.ward.time4
## radiant.first.ward.time5
## dire.bottle.time2
## dire.bottle.time3
## dire.bottle.time4
## dire.bottle.time5 *
## dire.courier.time2 .
## dire.courier.time3 *
## dire.courier.time4
## dire.courier.time5
## dire.fly.courier.time2
## dire.fly.courier.time3 .
## dire.fly.courier.time4
## dire.fly.courier.time5 **
## dire.first.ward.time2
## dire.first.ward.time3 *
## dire.first.ward.time4
## dire.first.ward.time5
## duration.cnormal
## duration.clate .
## r.carry **
## r.disabler .
## r.durable
## r.escape .
## r.initiator
## r.jungler
## r.nuker
## r.pusher *
## r.support *
## d.carry
## d.disabler *
## d.durable .
## d.escape
## d.initiator
## d.jungler
## d.nuker
## d.pusher
## d.support **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4436.0 on 3201 degrees of freedom
## Residual deviance: 1707.8 on 3108 degrees of freedom
## AIC: 1895.8
##
## Number of Fisher Scoring iterations: 6
From the summary above we found 5 variables are not defined because of singularities. That’s happen because two or more of the variables are perfectly collinear. To identify which variable is colinear, we can use alias() function to our model
## (Intercept) first_blood_team1 first_blood_team3
## first_blood_player19 0 1 0
## first_blood_player110 0 0 1
## first_blood_player29 1 -1 -1
## first_blood_player210 0 0 1
## first.blood.time5 0 0 1
## first_blood_player11 first_blood_player12
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## first_blood_player13 first_blood_player14
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## first_blood_player15 first_blood_player16
## first_blood_player19 -1 -1
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## first_blood_player17 first_blood_player18
## first_blood_player19 -1 -1
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## first_blood_player21 first_blood_player22
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## first_blood_player23 first_blood_player24
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## first_blood_player25 first_blood_player26
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 -1 -1
## first_blood_player210 0 0
## first.blood.time5 0 0
## first_blood_player27 first_blood_player28
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 -1 -1
## first_blood_player210 0 0
## first.blood.time5 0 0
## radiant_tpscroll_count radiant_boots_count
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## radiant_ward_observer_count radiant_ward_sentry_count
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## dire_tpscroll_count dire_boots_count
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## dire_ward_observer_count dire_ward_sentry_count
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## radiant_gold radiant_xp radiant_level radiant_lh
## first_blood_player19 0 0 0 0
## first_blood_player110 0 0 0 0
## first_blood_player29 0 0 0 0
## first_blood_player210 0 0 0 0
## first.blood.time5 0 0 0 0
## radiant_kills radiant_deaths dire_gold dire_xp dire_level
## first_blood_player19 0 0 0 0 0
## first_blood_player110 0 0 0 0 0
## first_blood_player29 0 0 0 0 0
## first_blood_player210 0 0 0 0 0
## first.blood.time5 0 0 0 0 0
## dire_lh dire_kills dire_deaths first.blood.time2
## first_blood_player19 0 0 0 0
## first_blood_player110 0 0 0 0
## first_blood_player29 0 0 0 0
## first_blood_player210 0 0 0 0
## first.blood.time5 0 0 0 0
## first.blood.time3 first.blood.time4 radiant.bottle.time2
## first_blood_player19 0 0 0
## first_blood_player110 0 0 0
## first_blood_player29 0 0 0
## first_blood_player210 0 0 0
## first.blood.time5 0 0 0
## radiant.bottle.time3 radiant.bottle.time4
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## radiant.bottle.time5 radiant.courier.time2
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## radiant.courier.time3 radiant.courier.time4
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## radiant.courier.time5 radiant.fly.courier.time2
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## radiant.fly.courier.time3 radiant.fly.courier.time4
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## radiant.fly.courier.time5 radiant.first.ward.time2
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## radiant.first.ward.time3 radiant.first.ward.time4
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## radiant.first.ward.time5 dire.bottle.time2
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## dire.bottle.time3 dire.bottle.time4 dire.bottle.time5
## first_blood_player19 0 0 0
## first_blood_player110 0 0 0
## first_blood_player29 0 0 0
## first_blood_player210 0 0 0
## first.blood.time5 0 0 0
## dire.courier.time2 dire.courier.time3 dire.courier.time4
## first_blood_player19 0 0 0
## first_blood_player110 0 0 0
## first_blood_player29 0 0 0
## first_blood_player210 0 0 0
## first.blood.time5 0 0 0
## dire.courier.time5 dire.fly.courier.time2
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## dire.fly.courier.time3 dire.fly.courier.time4
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## dire.fly.courier.time5 dire.first.ward.time2
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## dire.first.ward.time3 dire.first.ward.time4
## first_blood_player19 0 0
## first_blood_player110 0 0
## first_blood_player29 0 0
## first_blood_player210 0 0
## first.blood.time5 0 0
## dire.first.ward.time5 duration.cnormal duration.clate
## first_blood_player19 0 0 0
## first_blood_player110 0 0 0
## first_blood_player29 0 0 0
## first_blood_player210 0 0 0
## first.blood.time5 0 0 0
## r.carry r.disabler r.durable r.escape r.initiator
## first_blood_player19 0 0 0 0 0
## first_blood_player110 0 0 0 0 0
## first_blood_player29 0 0 0 0 0
## first_blood_player210 0 0 0 0 0
## first.blood.time5 0 0 0 0 0
## r.jungler r.nuker r.pusher r.support d.carry d.disabler
## first_blood_player19 0 0 0 0 0 0
## first_blood_player110 0 0 0 0 0 0
## first_blood_player29 0 0 0 0 0 0
## first_blood_player210 0 0 0 0 0 0
## first.blood.time5 0 0 0 0 0 0
## d.durable d.escape d.initiator d.jungler d.nuker d.pusher
## first_blood_player19 0 0 0 0 0 0
## first_blood_player110 0 0 0 0 0 0
## first_blood_player29 0 0 0 0 0 0
## first_blood_player210 0 0 0 0 0 0
## first.blood.time5 0 0 0 0 0 0
## d.support
## first_blood_player19 0
## first_blood_player110 0
## first_blood_player29 0
## first_blood_player210 0
## first.blood.time5 0
From the alias we found first_blood_team1 is highly corelated first_blood_player29 as well as first_blood_team3 is highly corelated with first.blood.time5 and each variables pair that have 1/-1 in the matrix. we should remove first_blood_player1, first_blood_player2, and first.blood.time variables that can be problematic with respect to multicollinearity, and re-build the model.
trainer1 <- trainer1 %>% select(-c("first_blood_player1","first_blood_player2","first.blood.time"))
tester1 <- tester1 %>% select(-c("first_blood_player1","first_blood_player2","first.blood.time"))# re-build the model
glm.modx <- glm(radiant_win~., family = "binomial", data=trainer1)
summary(glm.modx)##
## Call:
## glm(formula = radiant_win ~ ., family = "binomial", data = trainer1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1239 -0.3339 0.0384 0.3360 3.5139
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.573158 1.422340 0.403 0.686972
## first_blood_team1 0.250230 0.160030 1.564 0.117901
## first_blood_team3 -0.148207 0.280242 -0.529 0.596906
## radiant_tpscroll_count -0.555844 0.602067 -0.923 0.355890
## radiant_boots_count -0.291132 0.426301 -0.683 0.494654
## radiant_ward_observer_count -0.199725 0.507097 -0.394 0.693685
## radiant_ward_sentry_count 0.999984 0.883201 1.132 0.257539
## dire_tpscroll_count 0.381178 0.551612 0.691 0.489549
## dire_boots_count -0.079117 0.423894 -0.187 0.851940
## dire_ward_observer_count 0.389431 0.532573 0.731 0.464641
## dire_ward_sentry_count -2.292046 1.554027 -1.475 0.140237
## radiant_gold 5.406592 1.540102 3.511 0.000447
## radiant_xp -1.527721 1.272678 -1.200 0.229984
## radiant_level -0.357128 0.931393 -0.383 0.701398
## radiant_lh 0.493833 0.993591 0.497 0.619176
## radiant_kills -3.957127 0.931430 -4.248 0.000021526
## radiant_deaths -18.206548 0.725634 -25.091 < 0.0000000000000002
## dire_gold -6.553445 1.747797 -3.750 0.000177
## dire_xp -0.009272 1.320130 -0.007 0.994396
## dire_level 0.308288 0.919844 0.335 0.737510
## dire_lh 1.342968 1.053349 1.275 0.202327
## dire_kills 5.575367 1.075761 5.183 0.000000219
## dire_deaths 18.714164 0.743607 25.167 < 0.0000000000000002
## radiant.bottle.time2 -0.030721 0.206415 -0.149 0.881685
## radiant.bottle.time3 -0.154340 0.210646 -0.733 0.463743
## radiant.bottle.time4 -0.231778 0.210194 -1.103 0.270164
## radiant.bottle.time5 0.034875 0.202606 0.172 0.863335
## radiant.courier.time2 -0.137913 0.179984 -0.766 0.443525
## radiant.courier.time3 -0.325337 0.214871 -1.514 0.130000
## radiant.courier.time4 -0.281321 0.224270 -1.254 0.209702
## radiant.courier.time5 -1.216532 0.605217 -2.010 0.044423
## radiant.fly.courier.time2 0.155010 0.249647 0.621 0.534655
## radiant.fly.courier.time3 -0.080350 0.358810 -0.224 0.822808
## radiant.fly.courier.time4 -0.044187 0.217117 -0.204 0.838733
## radiant.fly.courier.time5 -0.065200 0.205968 -0.317 0.751582
## radiant.first.ward.time2 0.447264 0.200334 2.233 0.025576
## radiant.first.ward.time3 0.160904 0.180941 0.889 0.373862
## radiant.first.ward.time4 0.076372 0.259180 0.295 0.768249
## radiant.first.ward.time5 0.118952 0.303787 0.392 0.695381
## dire.bottle.time2 0.335997 0.202401 1.660 0.096902
## dire.bottle.time3 0.238087 0.207832 1.146 0.251972
## dire.bottle.time4 0.052673 0.205312 0.257 0.797524
## dire.bottle.time5 -0.370267 0.200481 -1.847 0.064762
## dire.courier.time2 0.344309 0.185793 1.853 0.063856
## dire.courier.time3 0.399966 0.210840 1.897 0.057826
## dire.courier.time4 0.256241 0.224465 1.142 0.253637
## dire.courier.time5 0.680757 0.587337 1.159 0.246433
## dire.fly.courier.time2 0.219441 0.241159 0.910 0.362853
## dire.fly.courier.time3 0.602158 0.331677 1.815 0.069448
## dire.fly.courier.time4 0.211294 0.210944 1.002 0.316508
## dire.fly.courier.time5 0.582036 0.196383 2.964 0.003039
## dire.first.ward.time2 -0.233495 0.195112 -1.197 0.231414
## dire.first.ward.time3 -0.367372 0.187683 -1.957 0.050299
## dire.first.ward.time4 -0.073471 0.247524 -0.297 0.766600
## dire.first.ward.time5 -0.200589 0.295729 -0.678 0.497589
## duration.cnormal -0.050582 0.173247 -0.292 0.770315
## duration.clate -0.330714 0.179374 -1.844 0.065225
## r.carry 0.277567 0.082062 3.382 0.000718
## r.disabler -0.156464 0.086069 -1.818 0.069082
## r.durable -0.024560 0.078114 -0.314 0.753205
## r.escape -0.129086 0.069771 -1.850 0.064294
## r.initiator 0.016762 0.077776 0.216 0.829366
## r.jungler -0.033309 0.111380 -0.299 0.764897
## r.nuker -0.125011 0.079819 -1.566 0.117306
## r.pusher 0.168854 0.078664 2.147 0.031831
## r.support 0.195588 0.083758 2.335 0.019535
## d.carry -0.023203 0.083594 -0.278 0.781342
## d.disabler 0.159521 0.080342 1.986 0.047085
## d.durable -0.147526 0.076180 -1.937 0.052800
## d.escape 0.067668 0.067785 0.998 0.318150
## d.initiator -0.116097 0.076120 -1.525 0.127214
## d.jungler -0.080224 0.111269 -0.721 0.470915
## d.nuker 0.054818 0.080375 0.682 0.495227
## d.pusher -0.085410 0.081397 -1.049 0.294040
## d.support -0.212370 0.081087 -2.619 0.008818
##
## (Intercept)
## first_blood_team1
## first_blood_team3
## radiant_tpscroll_count
## radiant_boots_count
## radiant_ward_observer_count
## radiant_ward_sentry_count
## dire_tpscroll_count
## dire_boots_count
## dire_ward_observer_count
## dire_ward_sentry_count
## radiant_gold ***
## radiant_xp
## radiant_level
## radiant_lh
## radiant_kills ***
## radiant_deaths ***
## dire_gold ***
## dire_xp
## dire_level
## dire_lh
## dire_kills ***
## dire_deaths ***
## radiant.bottle.time2
## radiant.bottle.time3
## radiant.bottle.time4
## radiant.bottle.time5
## radiant.courier.time2
## radiant.courier.time3
## radiant.courier.time4
## radiant.courier.time5 *
## radiant.fly.courier.time2
## radiant.fly.courier.time3
## radiant.fly.courier.time4
## radiant.fly.courier.time5
## radiant.first.ward.time2 *
## radiant.first.ward.time3
## radiant.first.ward.time4
## radiant.first.ward.time5
## dire.bottle.time2 .
## dire.bottle.time3
## dire.bottle.time4
## dire.bottle.time5 .
## dire.courier.time2 .
## dire.courier.time3 .
## dire.courier.time4
## dire.courier.time5
## dire.fly.courier.time2
## dire.fly.courier.time3 .
## dire.fly.courier.time4
## dire.fly.courier.time5 **
## dire.first.ward.time2
## dire.first.ward.time3 .
## dire.first.ward.time4
## dire.first.ward.time5
## duration.cnormal
## duration.clate .
## r.carry ***
## r.disabler .
## r.durable
## r.escape .
## r.initiator
## r.jungler
## r.nuker
## r.pusher *
## r.support *
## d.carry
## d.disabler *
## d.durable .
## d.escape
## d.initiator
## d.jungler
## d.nuker
## d.pusher
## d.support **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4436.0 on 3201 degrees of freedom
## Residual deviance: 1721.8 on 3127 degrees of freedom
## AIC: 1871.8
##
## Number of Fisher Scoring iterations: 6
The model doesn’t have any problem this time. let’s move to prediction
# predict to test data
tester1$predicted <- predict(glm.modx, type = "response", newdata = tester1)
# using global default 0.5 for treshold
tester1$pred.lab <- ifelse(tester1$predicted > 0.5, "True", "False")
predicted.df <- as.data.frame(cbind("pred.glm1" = tester1$predicted,
"label.glm1" = tester1$pred.lab))
# drop new column in tester
tester1 <- tester1 %>% select(1:49)
# confusion matrix
glm.ev1 <- confusionMatrix(predicted.df$label.glm1, reference = tester1$radiant_win,
positive = "True")
glm.ev1## Confusion Matrix and Statistics
##
## Reference
## Prediction False True
## False 350 43
## True 37 368
##
## Accuracy : 0.8997
## 95% CI : (0.8768, 0.9197)
## No Information Rate : 0.515
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.7994
##
## Mcnemar's Test P-Value : 0.5762
##
## Sensitivity : 0.8954
## Specificity : 0.9044
## Pos Pred Value : 0.9086
## Neg Pred Value : 0.8906
## Prevalence : 0.5150
## Detection Rate : 0.4612
## Detection Prevalence : 0.5075
## Balanced Accuracy : 0.8999
##
## 'Positive' Class : True
##
First we need to split data train and test from their label
trainer1.x <- trainer1[,-10]
tester1.x <- tester1[,-c(10)]
trainer1.y <- trainer1[,10]
tester1.y <- tester1[,10]Note: KNN model only accept numeric type of data. well, we actually can convert all factor into a set of booleans (it was called dummy variables right?). But i think that’s so much to do and we’ll end up to hundreds of variables. so ill drop all variable that is not numeric instead
I was tought there’s a common strategy to find K optimum for KNN model. we calculate the square root of the training’s obs number
## [1] 57
# Knn modeling
knn.mod <- class::knn(train = trainer1.num, test = tester1.num,
cl = trainer1[,10], k = kk)
# confusion matrix
knn.ev1 <- confusionMatrix(knn.mod, reference = tester1.y,
positive = "True")
knn.ev1## Confusion Matrix and Statistics
##
## Reference
## Prediction False True
## False 163 114
## True 224 297
##
## Accuracy : 0.5764
## 95% CI : (0.5413, 0.611)
## No Information Rate : 0.515
## P-Value [Acc > NIR] : 0.0002876
##
## Kappa : 0.145
##
## Mcnemar's Test P-Value : 0.000000003051
##
## Sensitivity : 0.7226
## Specificity : 0.4212
## Pos Pred Value : 0.5701
## Neg Pred Value : 0.5884
## Prevalence : 0.5150
## Detection Rate : 0.3722
## Detection Prevalence : 0.6529
## Balanced Accuracy : 0.5719
##
## 'Positive' Class : True
##
As we can see, glm model is way more better than knn (accuracy 89.7 > 57.3). However, we’ll try to improve both model to achive higher accuracy (or other metrics)
Bane image by: chroneco
One way to improve glm model is stepwise. i’ll using backward direction to drop the least contributive predictors. step() function will try to build glm model from every availiable variables then drop it one by one until the model have the lowest AIC. It also drop if one variable is correlated to each other to avoid multicolinearity. AIC estimates the relative amount of information lost by a given model. The less information a model loses, the higher the quality of that model.
##
## Call:
## glm(formula = radiant_win ~ radiant_ward_sentry_count + radiant_gold +
## radiant_kills + radiant_deaths + dire_gold + dire_kills +
## dire_deaths + dire.bottle.time + dire.fly.courier.time +
## duration.c + r.carry + r.disabler + r.escape + r.nuker +
## r.pusher + r.support + d.disabler + d.durable + d.initiator +
## d.support, family = "binomial", data = trainer1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0618 -0.3464 0.0413 0.3452 3.7217
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.87515 0.83830 -1.044 0.296505
## radiant_ward_sentry_count 1.30272 0.82584 1.577 0.114695
## radiant_gold 5.52261 0.84924 6.503 0.000000000078719 ***
## radiant_kills -4.39578 0.60841 -7.225 0.000000000000501 ***
## radiant_deaths -17.60185 0.69013 -25.505 < 0.0000000000000002 ***
## dire_gold -4.48986 0.93544 -4.800 0.000001588779142 ***
## dire_kills 4.54758 0.66986 6.789 0.000000000011304 ***
## dire_deaths 18.25000 0.70511 25.883 < 0.0000000000000002 ***
## dire.bottle.time2 0.34384 0.19551 1.759 0.078628 .
## dire.bottle.time3 0.23272 0.19971 1.165 0.243893
## dire.bottle.time4 0.02020 0.19747 0.102 0.918505
## dire.bottle.time5 -0.38890 0.19213 -2.024 0.042950 *
## dire.fly.courier.time2 0.25997 0.23608 1.101 0.270816
## dire.fly.courier.time3 0.54348 0.32344 1.680 0.092894 .
## dire.fly.courier.time4 0.20935 0.20396 1.026 0.304697
## dire.fly.courier.time5 0.55188 0.18313 3.014 0.002582 **
## duration.cnormal -0.04879 0.16839 -0.290 0.772018
## duration.clate -0.36884 0.17128 -2.153 0.031282 *
## r.carry 0.25744 0.07534 3.417 0.000633 ***
## r.disabler -0.15092 0.07364 -2.049 0.040428 *
## r.escape -0.12639 0.06265 -2.017 0.043655 *
## r.nuker -0.10344 0.06915 -1.496 0.134698
## r.pusher 0.16918 0.07158 2.364 0.018095 *
## r.support 0.23258 0.07569 3.073 0.002120 **
## d.disabler 0.17478 0.07652 2.284 0.022363 *
## d.durable -0.19007 0.06640 -2.863 0.004202 **
## d.initiator -0.11190 0.07173 -1.560 0.118729
## d.support -0.20817 0.07309 -2.848 0.004396 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4436.0 on 3201 degrees of freedom
## Residual deviance: 1761.1 on 3174 degrees of freedom
## AIC: 1817.1
##
## Number of Fisher Scoring iterations: 6
From the summary above we can conclude that gold, kills, and deaths from both teams are the most significant variable in our model (based on lowest p-value). from hero roles, Carry have the highest significant value then other roles, followed by support and durable. duration.late have high negative influence for radiant winning, means that the longer game tends to make radiant team loses.
For Dota2 players out there we might feel so related from our model summary. From our experience We know that carry and support have very different task in the game and both of them are super important for winning. Tanky heroes (durable) and disabler are also important. having bottle and flying courier in first 5 minutes also have high significant to team winning. Having different roles in team and do all the detailed game mechanics like bottle and courier are needed to win, said our experience and justified by data analytics.
Multicollinearity check
we’re using Variance Inflation Factor (VIF) to check multicollinearity among our variables in model. A common rule of thumbs is that a VIF number greater than 10 may indicate high collinearity and worth further inspection
## GVIF Df GVIF^(1/(2*Df))
## radiant_ward_sentry_count 1.056419 1 1.027823
## radiant_gold 1.956312 1 1.398682
## radiant_kills 2.051588 1 1.432336
## radiant_deaths 2.922481 1 1.709526
## dire_gold 2.135187 1 1.461228
## dire_kills 2.236793 1 1.495591
## dire_deaths 3.038014 1 1.742990
## dire.bottle.time 1.181329 4 1.021048
## dire.fly.courier.time 1.137034 4 1.016182
## duration.c 1.502324 2 1.107110
## r.carry 1.197186 1 1.094160
## r.disabler 1.164305 1 1.079030
## r.escape 1.094758 1 1.046307
## r.nuker 1.096136 1 1.046965
## r.pusher 1.072751 1 1.035737
## r.support 1.236724 1 1.112081
## d.disabler 1.409389 1 1.187177
## d.durable 1.175789 1 1.084338
## d.initiator 1.393327 1 1.180393
## d.support 1.193396 1 1.092426
Looks like our model doesnt have any multicollinearity.
# predict to test data
predicted.df$pred.glm2 <- predict(glm.mod.2, type = "response", newdata = tester1)
# still use 0.5 treshold
predicted.df$label.glm2 <- ifelse(predicted.df$pred.glm2 > 0.5, "True", "False")
# confusion matrix
glm.ev2 <- confusionMatrix(as.factor(predicted.df$label.glm2), reference = tester1$radiant_win,
positive = "True")
glm.ev2## Confusion Matrix and Statistics
##
## Reference
## Prediction False True
## False 347 47
## True 40 364
##
## Accuracy : 0.891
## 95% CI : (0.8673, 0.9118)
## No Information Rate : 0.515
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.7819
##
## Mcnemar's Test P-Value : 0.5201
##
## Sensitivity : 0.8856
## Specificity : 0.8966
## Pos Pred Value : 0.9010
## Neg Pred Value : 0.8807
## Prevalence : 0.5150
## Detection Rate : 0.4561
## Detection Prevalence : 0.5063
## Balanced Accuracy : 0.8911
##
## 'Positive' Class : True
##
i’m gonna crazy with this. i decide to convert all the factor column into numeric with dummy_cols. i also use remove_first_dummy to avoid multicolinerity
# select only factor variable to make its dummmies
trainer1.x.fac <- select_if(trainer1.x, is.factor)
tester1.x.fac <- select_if(tester1.x, is.factor)
# build dummy variable
dum.v.train <- dummy_cols(trainer1.x.fac, remove_first_dummy = TRUE)
dum.v.train <- dum.v.train[,11:46]
dum.v.test <- dummy_cols(tester1.x.fac, remove_first_dummy = TRUE)
dum.v.test <- dum.v.test[,11:46]
# bind with numeric df
trainer2.x <- cbind(trainer1.num, dum.v.train)
tester2.x <- cbind(tester1.num, dum.v.test)# Modell new knn
knn.mod.2 <- class::knn(train = trainer2.x, test = tester2.x,
cl = trainer1[,10], k = kk)
knn.ev2 <- confusionMatrix(knn.mod.2, reference = tester1.y,
positive = "True")
knn.ev2## Confusion Matrix and Statistics
##
## Reference
## Prediction False True
## False 148 103
## True 239 308
##
## Accuracy : 0.5714
## 95% CI : (0.5363, 0.6061)
## No Information Rate : 0.515
## P-Value [Acc > NIR] : 0.0007939
##
## Kappa : 0.1332
##
## Mcnemar's Test P-Value : 0.0000000000002878
##
## Sensitivity : 0.7494
## Specificity : 0.3824
## Pos Pred Value : 0.5631
## Neg Pred Value : 0.5896
## Prevalence : 0.5150
## Detection Rate : 0.3860
## Detection Prevalence : 0.6855
## Balanced Accuracy : 0.5659
##
## 'Positive' Class : True
##
we see that adding additional predictor doesn’t help much. we’ll try to change the K to find the best accuracy. i build a loop function to train knn from k=40 to k=70, build the confusion matrix, and store the accuracy. thanks to this article for the idea
# build the loop from k=40 until k=70 by 1
k_values <- seq(40,70,1)
num_k <- length(k_values)
# make empty table to save accuracy rates
acc.k.df <- tibble(k = rep(0, num_k), acc = rep(0, num_k))
# evaluate knn for a bunch of values of k
for(i in 1:num_k){
k <- k_values[i]
# build knn model from given k in loop start from 40 until 70 k
k.finder <- class::knn(train = trainer2.x, test = tester2.x,
cl = trainer1[,10], k = k)
# build confusion matrix (yes, i build knn model and its confusion matrix in every loop iteration
# it'll be so expensive in the terms of computation, so dont try it with higher obs)
k.conf <- confusionMatrix(k.finder, reference = tester1.y,
positive = "True")
# store l values to table
acc.k.df[i, 'k'] <- k
# store acc values from confusion matrix
acc.k.df[i, 'acc'] <- k.conf$overall[[1]]
}Let’s draw a plot from the K loop just to make the intepretation fancier
acc.k.df %>% ggplot(aes(x = k, y= acc)) +
geom_point() + geom_line() + theme_bw() +
scale_x_continuous(breaks = seq(40,70,2))Extra note: Do to randomness of KNN algorithm, the plot you see above might be different from what i see when i run the model (and when i knit the rmd). The result are different everytime i run the models. so here’s the picture of my plot when i write this article. From here, the analysis are based from plot i upload below.
from the plot we know that K = 60 have the highest accuracy, then lets re-model our knn
knn.mod.3 <- class::knn(train = trainer2.x, test = tester2.x,
cl = trainer1[,10], k = 60)
knn.ev3 <- confusionMatrix(knn.mod.2, reference = tester1.y,
positive = "True")
knn.ev3## Confusion Matrix and Statistics
##
## Reference
## Prediction False True
## False 148 103
## True 239 308
##
## Accuracy : 0.5714
## 95% CI : (0.5363, 0.6061)
## No Information Rate : 0.515
## P-Value [Acc > NIR] : 0.0007939
##
## Kappa : 0.1332
##
## Mcnemar's Test P-Value : 0.0000000000002878
##
## Sensitivity : 0.7494
## Specificity : 0.3824
## Pos Pred Value : 0.5631
## Neg Pred Value : 0.5896
## Prevalence : 0.5150
## Detection Rate : 0.3860
## Detection Prevalence : 0.6855
## Balanced Accuracy : 0.5659
##
## 'Positive' Class : True
##
oh no! its actually not hahaha. looks like changing K not actually answer our problem. KNN algorithm use Euclidean distance to measure shortest direct route. If we convert catgorical variable to numeric, it’ll only have two option 1 and 0. Thus, the distance will easily biased since the different about two variables are not well separated. So we can say that KNN are not suitable for our case because our data have a lot categoric variables and KNN are not suitable at dealing with categoric variable
Bristleback image by: chroneco
# combine all confusion matrix into one dataframe to make it easier to evaluate
eval.glm <- data.frame(t(as.matrix(glm.ev1, what = "classes")))
eval.glm <- cbind(eval.glm, data.frame(t(as.matrix(glm.ev1, what = "overall"))))
eval.knn <- data.frame(t(as.matrix(knn.ev1, what = "classes")))
eval.knn <- cbind(eval.knn, data.frame(t(as.matrix(knn.ev1, what = "overall"))))
eval.glm.2 <- data.frame(t(as.matrix(glm.ev2, what = "classes")))
eval.glm.2 <- cbind(eval.glm.2, data.frame(t(as.matrix(glm.ev2, what = "overall"))))
eval.knn.2 <- data.frame(t(as.matrix(knn.ev2, what = "classes")))
eval.knn.2 <- cbind(eval.knn.2, data.frame(t(as.matrix(knn.ev2, what = "overall"))))
mod.eval <- rbind(eval.glm,eval.knn,eval.glm.2,eval.knn.2)
mod.eval <- mod.eval %>% `row.names<-`(c("glm","knn","glm2","knn2"))
head(mod.eval[,c("Sensitivity","Specificity","Precision","Recall","F1","Accuracy")])From the table above we know that glm is way better than knn even though i already input all variable and change the K in KNN. This is because KNN are not suitable at dealing with categoric variable and somehow half of our variables are categorical. The glm without stepwise have the best metrics among all models. Our glm imporvement with backward stepwise actually doesn’t improve any metrics evaluation, but we can say that the model have better quality because of its lower AIC. It’ll always possible to have higher accuracy (or other metrics) if we try another classification model. We’ll do that in the future.
The most common use case in prediction dota2 winner is for gambling/bet, right? regardless of the bad thing of gambling addiction, it is important for us to look at the precision value. We need to limit the number of false positive so that our investment (aka money in gambling) didn’t get much wasted. So if we prioritize model with higher precision value, in this case, we’ll use Logistic Regression without backward stepwise model. (even though the model only slightly better)
Thank you !
Shadow Fiend image by: chroneco