You know what’s Dota right? DotA stands for Defends of the Ancients. Dota 2 is a multiplayer online battle arena (MOBA) video game developed and published by Valve. Dota 2 is played in matches between two teams (called Radiant and Dire) of five players, with each team occupying and defending their own separate base on the map. Each of the ten players independently controls a powerful character, known as a hero. A team wins by being the first to destroy the other team’s Ancient, a large structure located within their base. Dota 2 Header

Background

Objective

This project is based on an old Kaggle competition. The training set consists of matches, for which all of the ingame events (like kills, item purchase etc.) as well as match outcomes are known. We are given only the first 5 minutes of each match and need to predict the likelihood of Radiant victory. You can found all the datasets and the competition here. As you can see from the kaggle page, there’s only 2 notebooks and it doesn’t tell that much. So we’re going to build the prediction from scratch.

Libraries

You can load the package into your workspace using the library() function

library(dplyr)
library(stringr)
library(rjson)
library(rsample)
library(data.table)
library(dummies)
library(fastDummies)
library(tokenizers)
library(caret)

Let’s begin

Crystal Maiden image by: chroneco

Data Import

The Kaggle provide 5 datasets; test, train, hero_names, item_ids, and submission example. since i’m not gonna submit my prediction to the late competition, i don’t need to load the submission. The train dataset contain 409794 obs and 103 variables (one target label) so we have 102 variables to predict which team is winning. The test contain 71772 obs and we also did not need that because i’m not submit my result.

train1 <- read.csv("train.csv")
hero.name <- fromJSON(file = "hero_names.json")
item.id <- fromJSON(file = "item_ids.json")
item.id <- as.data.frame(t(as.matrix(item.id)))
item.id <- item.id %>% mutate(item.label = rownames(item.id))

Data Wrangling / Pre-process / Data Preparation

Exploratory Data Analysis

glimpse(train1)

## Observations: 409,794
## Variables: 103
## $ match_id                    <int> 1956782143, 1956782359, 1956782465, 195...
## $ r1_hero                     <int> 106, 112, 46, 35, 68, 62, 8, 53, 7, 12,...
## $ r1_level                    <int> 3, 0, 2, 0, 0, 1, 0, 0, 2, 0, 1, 0, 0, ...
## $ r1_xp                       <int> 2060, 747, 1839, 1681, 445, 953, 1737, ...
## $ r1_gold                     <int> 1864, 1117, 1477, 1312, 613, 1120, 1439...
## $ r1_lh                       <int> 28, 0, 13, 14, 3, 2, 22, 7, 19, 27, 9, ...
## $ r1_kills                    <int> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ r1_deaths                   <int> 7, 8, 0, 12, 9, 7, 3, 8, 3, 4, 6, 3, 20...
## $ r1_items                    <int> 6, 13, 10, 8, 6, 11, 12, 4, 8, 10, 9, 8...
## $ r2_hero                     <int> 8, 50, 6, 75, 107, 13, 80, 11, 14, 21, ...
## $ r2_level                    <int> 2, 0, 1, 0, 0, 1, 1, 0, 1, 0, 2, 0, 0, ...
## $ r2_xp                       <int> 1621, 1106, 1121, 1099, 1454, 768, 1471...
## $ r2_gold                     <int> 1040, 1330, 949, 1394, 980, 714, 1050, ...
## $ r2_lh                       <int> 20, 2, 11, 11, 2, 5, 13, 10, 6, 1, 9, 1...
## $ r2_kills                    <int> 0, 3, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ r2_deaths                   <int> 3, 1, 2, 3, 9, 6, 4, 15, 9, 8, 6, 1, 12...
## $ r2_items                    <int> 7, 10, 10, 9, 7, 9, 9, 7, 2, 10, 6, 10,...
## $ r3_hero                     <int> 62, 1, 2, 43, 65, 11, 25, 55, 73, 11, 7...
## $ r3_level                    <int> 1, 0, 0, 0, 0, 1, 0, 0, 2, 0, 2, 0, 0, ...
## $ r3_xp                       <int> 642, 1344, 1705, 1210, 763, 1877, 1834,...
## $ r3_gold                     <int> 1297, 1707, 1911, 1269, 819, 1752, 1319...
## $ r3_lh                       <int> 1, 21, 20, 6, 10, 15, 17, 12, 13, 19, 1...
## $ r3_kills                    <int> 2, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ r3_deaths                   <int> 8, 3, 2, 1, 8, 3, 10, 9, 7, 5, 11, 1, 9...
## $ r3_items                    <int> 13, 7, 4, 9, 5, 8, 9, 9, 3, 9, 10, 6, 8...
## $ r4_hero                     <int> 95, 11, 45, 83, 81, 99, 7, 26, 59, 51, ...
## $ r4_level                    <int> 2, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, ...
## $ r4_xp                       <int> 818, 2217, 946, 1096, 1920, 1340, 433, ...
## $ r4_gold                     <int> 960, 1867, 984, 671, 1788, 1320, 746, 5...
## $ r4_lh                       <int> 14, 25, 3, 0, 29, 20, 0, 1, 16, 24, 1, ...
## $ r4_kills                    <int> 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...
## $ r4_deaths                   <int> 10, 3, 2, 8, 9, 7, 6, 15, 6, 9, 12, 5, ...
## $ r4_items                    <int> 8, 7, 10, 7, 9, 8, 15, 10, 7, 12, 7, 6,...
## $ r5_hero                     <int> 6, 69, 32, 97, 21, 18, 50, 21, 25, 50, ...
## $ r5_level                    <int> 2, 0, 1, 0, 0, 1, 0, 0, 2, 0, 1, 0, 0, ...
## $ r5_xp                       <int> 1678, 1810, 876, 994, 1875, 1994, 569, ...
## $ r5_gold                     <int> 1391, 904, 726, 678, 1256, 1778, 1048, ...
## $ r5_lh                       <int> 16, 8, 4, 3, 16, 30, 0, 13, 7, 0, 14, 1...
## $ r5_kills                    <int> 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
## $ r5_deaths                   <int> 9, 1, 3, 7, 6, 2, 7, 6, 5, 2, 8, 11, 7,...
## $ r5_items                    <int> 6, 5, 5, 6, 7, 9, 10, 8, 5, 12, 8, 10, ...
## $ d1_hero                     <int> 37, 25, 7, 50, 11, 28, 59, 28, 35, 73, ...
## $ d1_level                    <int> 2, 0, 1, 0, 0, 1, 0, 0, 2, 0, 1, 0, 0, ...
## $ d1_xp                       <int> 1686, 1671, 1115, 1027, 2200, 1181, 186...
## $ d1_gold                     <int> 1282, 1109, 1130, 952, 1677, 1280, 1613...
## $ d1_lh                       <int> 16, 15, 9, 2, 25, 14, 24, 10, 20, 19, 1...
## $ d1_kills                    <int> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 1, ...
## $ d1_deaths                   <int> 8, 5, 5, 4, 5, 10, 1, 12, 4, 5, 5, 0, 7...
## $ d1_items                    <int> 9, 8, 7, 12, 8, 9, 6, 8, 7, 8, 6, 6, 7,...
## $ d2_hero                     <int> 49, 30, 106, 11, 57, 90, 87, 20, 56, 39...
## $ d2_level                    <int> 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, ...
## $ d2_xp                       <int> 1141, 963, 893, 1831, 614, 924, 712, 11...
## $ d2_gold                     <int> 1078, 500, 1066, 1386, 500, 907, 887, 1...
## $ d2_lh                       <int> 14, 0, 9, 16, 0, 4, 1, 3, 7, 6, 9, 13, ...
## $ d2_kills                    <int> 0, 0, 0, 1, 0, 0, 1, 1, 1, 2, 0, 0, 2, ...
## $ d2_deaths                   <int> 5, 6, 6, 0, 6, 4, 2, 8, 10, 5, 4, 1, 4,...
## $ d2_items                    <int> 4, 7, 6, 7, 7, 8, 12, 9, 11, 9, 5, 9, 1...
## $ d3_hero                     <int> 111, 27, 39, 26, 28, 100, 38, 81, 74, 6...
## $ d3_level                    <int> 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 2, 0, 0, ...
## $ d3_xp                       <int> 963, 414, 1661, 1015, 1153, 1749, 366, ...
## $ d3_gold                     <int> 641, 562, 935, 1268, 971, 1421, 1224, 1...
## $ d3_lh                       <int> 1, 1, 10, 0, 10, 12, 0, 19, 10, 2, 13, ...
## $ d3_kills                    <int> 0, 0, 0, 2, 0, 0, 2, 1, 0, 0, 0, 0, 1, ...
## $ d3_deaths                   <int> 10, 6, 1, 1, 4, 9, 5, 7, 5, 10, 1, 10, ...
## $ d3_items                    <int> 10, 9, 5, 12, 9, 7, 7, 14, 8, 11, 6, 6,...
## $ d4_hero                     <int> 98, 106, 50, 67, 86, 53, 69, 8, 30, 100...
## $ d4_level                    <int> 3, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, ...
## $ d4_xp                       <int> 2071, 1237, 1111, 855, 487, 898, 1903, ...
## $ d4_gold                     <int> 1377, 1802, 690, 1032, 500, 1035, 1474,...
## $ d4_lh                       <int> 19, 31, 4, 11, 0, 13, 17, 9, 3, 6, 2, 1...
## $ d4_kills                    <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ d4_deaths                   <int> 11, 3, 9, 0, 12, 1, 4, 3, 8, 12, 7, 11,...
## $ d4_items                    <int> 6, 10, 10, 5, 9, 5, 9, 7, 4, 4, 10, 10,...
## $ d5_hero                     <int> 83, 2, 73, 71, 8, 21, 21, 57, 71, 87, 2...
## $ d5_level                    <int> 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, ...
## $ d5_xp                       <int> 293, 1487, 1002, 1536, 1122, 1728, 1768...
## $ d5_gold                     <int> 635, 1350, 1905, 1379, 1924, 1264, 1709...
## $ d5_lh                       <int> 3, 17, 11, 14, 34, 16, 16, 4, 8, 3, 9, ...
## $ d5_kills                    <int> 0, 0, 0, 1, 0, 0, 2, 2, 0, 0, 0, 0, 1, ...
## $ d5_deaths                   <int> 7, 12, 5, 1, 5, 5, 3, 1, 1, 8, 5, 0, 5,...
## $ d5_items                    <int> 7, 8, 5, 7, 10, 8, 9, 8, 4, 10, 3, 6, 9...
## $ first_blood_time            <dbl> 12, -14, 93, 4, 130, 149, 18, 97, 214, ...
## $ first_blood_team            <dbl> 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, ...
## $ first_blood_player1         <dbl> 2, 3, 3, 7, 1, 2, 7, 8, 6, 6, 5, 0, 6, ...
## $ first_blood_player2         <dbl> 7, 6, 6, 3, 8, 9, 2, 4, 2, 3, 3, 5, 4, ...
## $ radiant_bottle_time         <dbl> 109, 70, 161, -32, 190, 151, 238, 191, ...
## $ radiant_courier_time        <dbl> -64, -77, -69, -69, -84, -71, -85, -86,...
## $ radiant_flying_courier_time <dbl> 223, 184, 204, 202, 272, NA, 269, 224, ...
## $ radiant_tpscroll_count      <int> 1, 3, 3, 5, 0, 2, 6, 5, 2, 7, 2, 0, 4, ...
## $ radiant_boots_count         <int> 3, 3, 2, 3, 4, 5, 3, 2, 4, 3, 4, 4, 2, ...
## $ radiant_ward_observer_count <int> 2, 2, 3, 1, 2, 2, 2, 3, 1, 4, 1, 2, 1, ...
## $ radiant_ward_sentry_count   <int> 0, 1, 0, 0, 1, 1, 1, 1, 0, 2, 0, 0, 0, ...
## $ radiant_first_ward_time     <dbl> 61, 177, -25, -6, 5, -6, -21, -20, -5, ...
## $ dire_bottle_time            <dbl> 144, 135, 48, 125, 97, 154, 162, NA, NA...
## $ dire_courier_time           <dbl> -71, -81, -72, -87, -84, -68, -44, -75,...
## $ dire_flying_courier_time    <dbl> 184, 294, 219, 209, NA, 186, NA, 182, 2...
## $ dire_tpscroll_count         <int> 2, 7, 4, 3, 3, 3, 2, 2, 2, 2, 3, 1, 1, ...
## $ dire_boots_count            <int> 2, 1, 2, 5, 3, 2, 5, 3, 5, 4, 5, 2, 4, ...
## $ dire_ward_observer_count    <int> 2, 3, 2, 3, 2, 2, 3, 2, 1, 3, 1, 2, 2, ...
## $ dire_ward_sentry_count      <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, ...
## $ dire_first_ward_time        <dbl> 32, -24, -20, 35, -42, -8, 88, 71, NA, ...
## $ duration                    <int> 4001, 2628, 1721, 1965, 2522, 2936, 157...
## $ radiant_win                 <fct> True, True, True, False, False, False, ...

If you have no idea what those variable is all about, here’s the variable’s descriptions: - match_id: unique id for every match
- note r1-r5 mean player 1-5 in team Radiant, and d1-d5 mean player 1-5 in team Dire
- r1_hero : player’s hero (mapping can be found in hero_names.json)
- r1_level : maximum hero level reached (by the first 5 minutes of the game)
- r1_xp : maximum experience gained
- r1_gold : amount of gold earned
- r1_lh : last hits, number of creeps killed
- r1_kills : number of players killed
- r1_deaths : the number of deaths for the hero
- r1_items : the number of items purchased
- note: If the “first blood” event did not have time to occur in the first 5 minutes, the column contains a missing value.
- first_blood_time : time for the first blood (first_blood = first death/kill in the game)
- first_blood_team : the team that committed the first blood (0 - Radiant, 1 - Dire)
- first_blood_player1 : index of player who got the kill
- first_blood_player2 : index of player who got killed
- note: Features for both teams (prefixes radiant_ and dire_ ):
- radiant_bottle_time : the time the team first purchased the item “bottle”
- radiant_courier_time : acquisition time of the “courier” item
- radiant_flying_courier_time : acquisition time of the “flying_courier” item
- radiant_tpscroll_count : the number of tpscroll items bought in the first 5 minutes
- radiant_boots_count : the number of “boots” for the team in the first 5 minutes
- radiant_ward_observer_count : the number of “ward_observer” items
- radiant_ward_sentry_count : the number of “ward_sentry” items
- radiant_first_ward_time : the time for the first placed ward for the team
- radiant_win : True, if the Radiant team won, False - otherwise (this is or target variable)

There’s so much to do with data wrangling. as you can see from the glimpse above, there’s lots of variables that seems redundant. we’ll convert them to as few as possible without reducing its meaning

# i found 1 row in column first_blood_player_2 column contain -1. maybe its a typo. i'll change it to 0
train1$first_blood_player2[train1$first_blood_player2 == -1] <- 0
i <- c(1,2,9,10,17,18,25,26,33,34,41,42,49,50,57,58,65,66,73,74,81,83,84,85)
train1.edit <- train1
# change some column to factor
train1.edit[,i] <- lapply(train1.edit[,i], as.factor)
# i also found missing value in target vaiable (only 8 rows). I remove that rows and drop the factor levels
train1.edit <- train1.edit[!(train1.edit$radiant_win==""),] %>% droplevels()

# check proportion of target variable. make sure its evenly distributed
table(train1.edit$radiant_win) %>% prop.table()

## 
##     False      True 
## 0.4823835 0.5176165

it’s good to go

The dataset is way too large for my potato laptop, so i need to split it only to 4k obs

set.seed(1502)
train1.edit <- sample_n(train1.edit, 4000)

Feature Engineering

Summarize Redundant Variables

I’ll combine some numeric data per team

# r1-r5 stands for radiant player 1-5, and d1-d5 for dire player 1-5
train1.edit <- train1.edit %>%
  mutate(
    radiant_gold = as.integer(r1_gold + r2_gold + r3_gold + r4_gold + r5_gold),
    radiant_xp = as.integer(r1_xp + r2_xp + r3_xp + r4_xp + r5_xp),
    radiant_level = as.integer(r1_level + r2_level + r3_level + r4_level + r5_level),
    radiant_lh = as.integer(r1_lh + r2_lh + r3_lh + r4_lh + r5_lh),
    radiant_kills = as.integer(r1_kills + r2_kills + r3_kills + r4_kills + r5_kills),
    radiant_deaths = as.integer(r1_deaths + r2_deaths + r3_deaths + r4_deaths + r5_deaths),
    dire_gold = as.integer(d1_gold + d2_gold + d3_gold + d4_gold + d5_gold),
    dire_xp = as.integer(d1_xp + d2_xp + d3_xp + d4_xp + d5_xp),
    dire_level = as.integer(d1_level + d2_level + d3_level + d4_level + d5_level),
    dire_lh = as.integer(d1_lh + d2_lh + d3_lh + d4_lh + d5_lh),
    dire_kills = as.integer(d1_kills + d2_kills + d3_kills + d4_kills + d5_kills),
    dire_deaths = as.integer(d1_deaths + d2_deaths + d3_deaths + d4_deaths + d5_deaths)
  )

train1x <- train1.edit[,1:81] %>%
  select_if(Negate(is.integer))

train1.new <- cbind(train1x, train1.edit[,82:115])
# +90 means i add additional 90 seconds to the time variable to avoid negative occurrences in data. 
# In dota2 game, the game actually start from -1.30 minutes for preparation. 
# The minus in raw data tells the event happen before game started in 00.00
train1.new[,c(22,26,27,28,33,34,35,36,41,42)] <- train1.new[,c(22,26,27,28,33,34,35,36,41,42)] + 90

Convert Time Attribut to Categories

NA in time variable value is kinda tricky. if you fill NA with 0 it’ll assume the event occurs in second 0. we don’t want it to happen. I convert all the time (second in this case) variable to categoric value. The additional levels are meant to NA value, indiciting certain events do not happen

labels.time <- c("1","2","3","4")

train1.new <- train1.new %>% mutate(
  first.blood.time = cut(
    train1.new$first_blood_time,
    breaks = c(-Inf,103,171,232,Inf),
    labels = labels.time),
  radiant.bottle.time = cut(
    train1.new$radiant_bottle_time,
    breaks = c(-Inf, 166, 219, 257, Inf),
    labels = labels.time),
  radiant.courier.time = cut(
    train1.new$radiant_courier_time,
    breaks = c(-Inf, 7, 16, 27, Inf),
    labels = labels.time),
  radiant.fly.courier.time = cut(
    train1.new$radiant_flying_courier_time,
    breaks = c(-Inf, 278, 301, 311, Inf),
    labels = labels.time),
  radiant.first.ward.time = cut(
    train1.new$radiant_first_ward_time,
    breaks = c(-Inf, 68, 90, 150, Inf),
    labels = labels.time),
  dire.bottle.time = cut(
    train1.new$dire_bottle_time,
    breaks = c(-Inf, 166, 219, 257, Inf),
    labels = labels.time),
  dire.courier.time = cut(
    train1.new$dire_courier_time,
    breaks = c(-Inf, 7, 16, 27, Inf),
    labels = labels.time),
  dire.fly.courier.time = cut(
    train1.new$dire_flying_courier_time,
    breaks = c(-Inf, 278, 301, 311, Inf),
    labels = labels.time),
  dire.first.ward.time = cut(
    train1.new$dire_first_ward_time,
    breaks = c(-Inf, 68, 90, 150, Inf),
    labels = labels.time)
  )

# In Dota terms, we know that the duration of the game are usually splitted into 3 terms: fast game, normal game, and late game. 
# I convert the game duration into 3 categories: fast for min ~ 42 minutes, normal for 42 ~ 49 minutes, and late for 50 ~ max minutes.
train1.new <- train1.new %>% mutate(
  duration.c = cut(
    train1.new$duration/60,
    breaks = c(-Inf, 42, 49, Inf),
    labels = c("fast","normal","late")
  )
)

Lets see if we correctly convert numeric value to factor

boxplot(train1.new$first_blood_time~train1.new$first.blood.time, col = 3:6)

boxplot(train1.new$dire_bottle_time~train1.new$dire.bottle.time, col =3:6)

boxplot(train1.new$radiant_courier_time~train1.new$radiant.courier.time, col=3:6)

boxplot(train1.new$dire_first_ward_time~train1.new$dire.first.ward.time, col=3:6)

boxplot(train1.new$duration/60~train1.new$duration.c, col =3:5)

#drop unused variable
train1.x <- train1.new %>% select(-c(22,26,27,28,33,34,35,36,41,42))

Fill NA with New Value in Categoric

# fill NA = 5 to time variable. Level 5 means the team doesn't do the 'things'
# its hard for me to say it in english so here's an example: if first.blood.time levels = 5, 
# it means that match doesn't have any first blood in the frst 5 minutes

train1.x[46:54] <- lapply(train1.x[46:54], function(x){
  x <- factor(x, exclude = NULL)
  levels(x)[is.na(levels(x))] <- "5"
  return(x)
})

## for first.blood.team NA, i'll fill "3", means that match  doesn't have any first blood in first 5 minutes
levels(train1.x$first_blood_team) <-  c(levels(train1.x$first_blood_team), "3")
train1.x$first_blood_team[which(is.na(train1.x$first_blood_team))] <-  "3"


## and for first.blood.player NA, i'll fill "10", means that team doesn't have any first blodd
train1.x[23:24] <- lapply(train1.x[23:24], function(x){
  x <- factor(x, exclude = NULL)
  levels(x)[is.na(levels(x))] <- "10"
  return(x)
})

Normalization

We know that radiant/dire tp,boots,wards count have a different range with radiant/dire gold,exp,kills, etc. so we need to perform normalization in order to re-sclae them into one same range

# Normalization function 
normalize <- function(x){
  return ( 
    (x - min(x))/(max(x) - min(x)) 
  )}

# select integer class data
int <- sapply(train1.x, class)=="integer"
# Perform normalization
train1.x[,int] <- lapply(train1.x[,int], normalize)

Convert Hero used to its Roles Category

Now come the hardest part (for me). I actually spend a lot of time to figure out how we change hero to have fewer levels. Theres 117 heroes in Dota and this whole dataset using 115 of them. if we put 115 levels of factor i’m certanily sure it’ll broken the model. Good thing the competition provide heroes detail in json data. I convert every used heroes per team per match to its roles.

# select used hero variable in main dataframe
hero.use <- train1.x[,grepl("_hero", names(train1.x))]
# convert list (from json) to dataframe
hero.df2 <- rbindlist(hero.name)
hero.df2$roles <- as.factor(hero.df2$roles)
# In dota2, hero roles are splitted to 9 categories: carry, disabler, durable, escape, initiator, jungler, nuker, pusher, and support.
# one hero can have many roles, for example `Anti Mage` hero id `1` have Carry, Escape and Nuker
# group the hero by its id, then combine all the roles into per roles category.
hero.df2x <- hero.df2 %>% group_by(id) %>%
  summarise(roles.c = paste(roles, collapse = " "))
hero.df2x$id <- as.factor(hero.df2x$id)
# change all the id in hero.use to its matching roles
use.cate <- apply(hero.use, c(1,2), function(x) hero.df2x[hero.df2x$id == x, 2])
use.cate <- as.data.frame(matrix(unlist(use.cate),nrow = 4000), stringsAsFactors = F)

Then make new column correspond to used hero roles from each team. for example: r.carry means total number of heroes in radiant team that have a carry role (please keep in mind, one hero can have many roles)

# hero roles for radiant team per match
use.cate$r.carry <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "carry"))
use.cate$r.disabler <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "disabler"))
use.cate$r.durable <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "durable"))
use.cate$r.escape <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "escape"))
use.cate$r.initiator <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "initiator"))
use.cate$r.jungler <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "jungler"))
use.cate$r.nuker <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "nuker"))
use.cate$r.pusher <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "pusher"))
use.cate$r.support <- apply(use.cate[,1:5],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "support"))

# hero roles for dire team per match
use.cate$d.carry <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "carry"))
use.cate$d.disabler <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "disabler"))
use.cate$d.durable <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "durable"))
use.cate$d.escape <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "escape"))
use.cate$d.initiator <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "initiator"))
use.cate$d.jungler <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "jungler"))
use.cate$d.nuker <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "nuker"))
use.cate$d.pusher <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "pusher"))
use.cate$d.support <- apply(use.cate[,6:10],1,function(x)sum(unlist(tokenize_words(x))
                                       %in% "support"))

let’s take a quick look to our hero data

glimpse(use.cate)

## Observations: 4,000
## Variables: 28
## $ V1          <chr> "Carry Escape Nuker Disabler Initiator", "Carry Initiat...
## $ V2          <chr> "Carry Nuker", "Carry Initiator Durable Disabler Nuker"...
## $ V3          <chr> "Nuker Durable Escape", "Support Disabler Nuker", "Carr...
## $ V4          <chr> "Support Disabler Nuker", "Carry Nuker Pusher", "Carry ...
## $ V5          <chr> "Support Disabler Nuker Durable", "Disabler Initiator D...
## $ V6          <chr> "Carry Support Disabler Escape Nuker", "Support Disable...
## $ V7          <chr> "Carry Durable Initiator", "Initiator Jungler Escape Di...
## $ V8          <chr> "Initiator Disabler Nuker", "Carry Disabler Initiator D...
## $ V9          <chr> "Carry Initiator Disabler Durable Escape", "Carry Escap...
## $ V10         <chr> "Support Carry Durable", "Carry Nuker Pusher Initiator ...
## $ r.carry     <int> 2, 3, 4, 3, 4, 2, 2, 3, 2, 3, 3, 3, 2, 4, 2, 4, 2, 2, 3...
## $ r.disabler  <int> 3, 4, 3, 2, 4, 4, 4, 4, 5, 5, 3, 5, 5, 5, 3, 2, 5, 3, 4...
## $ r.durable   <int> 2, 3, 2, 0, 2, 3, 1, 2, 2, 2, 1, 2, 2, 4, 0, 3, 2, 1, 1...
## $ r.escape    <int> 2, 1, 2, 2, 3, 3, 3, 1, 1, 1, 4, 2, 0, 1, 3, 2, 1, 1, 3...
## $ r.initiator <int> 1, 3, 1, 0, 2, 2, 2, 2, 3, 1, 2, 2, 4, 4, 0, 3, 2, 3, 2...
## $ r.jungler   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 1, 0...
## $ r.nuker     <int> 5, 4, 3, 3, 5, 3, 5, 5, 5, 5, 3, 2, 4, 3, 5, 3, 3, 3, 4...
## $ r.pusher    <int> 0, 1, 3, 2, 2, 0, 2, 1, 1, 0, 0, 2, 0, 1, 1, 0, 0, 0, 0...
## $ r.support   <int> 2, 1, 1, 2, 0, 3, 2, 0, 2, 4, 2, 2, 3, 1, 1, 0, 2, 2, 3...
## $ d.carry     <int> 4, 3, 1, 2, 2, 4, 2, 4, 3, 3, 1, 3, 4, 3, 4, 5, 5, 2, 3...
## $ d.disabler  <int> 3, 5, 4, 4, 4, 3, 3, 4, 4, 2, 3, 2, 4, 4, 5, 5, 2, 4, 1...
## $ d.durable   <int> 3, 2, 2, 3, 1, 1, 2, 2, 2, 2, 3, 2, 3, 2, 1, 2, 2, 2, 1...
## $ d.escape    <int> 2, 2, 0, 4, 3, 2, 0, 3, 3, 0, 1, 2, 2, 1, 3, 2, 2, 1, 4...
## $ d.initiator <int> 3, 3, 2, 4, 3, 2, 2, 2, 1, 2, 2, 1, 2, 2, 0, 3, 1, 3, 1...
## $ d.jungler   <int> 0, 1, 1, 0, 1, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0...
## $ d.nuker     <int> 2, 3, 4, 4, 5, 4, 4, 2, 4, 3, 4, 4, 3, 4, 4, 4, 3, 3, 3...
## $ d.pusher    <int> 0, 1, 2, 0, 1, 2, 0, 1, 1, 0, 1, 1, 0, 0, 0, 2, 3, 1, 1...
## $ d.support   <int> 2, 1, 2, 0, 3, 2, 2, 2, 1, 1, 1, 2, 1, 2, 3, 2, 0, 3, 1...

# bind to main dataframe and drop unnecessary variables
use.cate <- use.cate %>% select(11:28)
train1.x <- cbind(train1.x,use.cate)
train1.x <- train1.x %>% select(-c(2,4,6,8,10,12,14,16,18,20))

Note: Items is one of very usefull components in Dota. But in this case i’m gonna exculude them for modelling since its contain so many levels per column. i’m afraid it will damage the model. As far as i can’t solve that problem, i assume items are not part of this game

# delete item variable from main dataframe
train1.clean <- train1.x %>% select(-c(2:11))

Finnaly, we can re-check our preprocessed data before modelling

#check new clean data train
glimpse(train1.clean)

## Observations: 4,000
## Variables: 53
## $ match_id                    <fct> 2000270094, 1964538474, 1962513300, 198...
## $ first_blood_team            <fct> 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 3, ...
## $ first_blood_player1         <fct> 4, 2, 7, 1, 0, 5, 0, 2, 5, 1, 2, 1, 10,...
## $ first_blood_player2         <fct> 7, 6, 0, 6, 7, 4, 6, 8, 2, 5, 7, 5, 10,...
## $ radiant_tpscroll_count      <dbl> 0.23076923, 0.15384615, 0.15384615, 0.3...
## $ radiant_boots_count         <dbl> 0.7142857, 0.4285714, 0.1428571, 0.7142...
## $ radiant_ward_observer_count <dbl> 0.1666667, 0.3333333, 0.3333333, 0.3333...
## $ radiant_ward_sentry_count   <dbl> 0.125, 0.125, 0.000, 0.125, 0.000, 0.12...
## $ dire_tpscroll_count         <dbl> 0.41666667, 0.25000000, 0.16666667, 0.3...
## $ dire_boots_count            <dbl> 0.2857143, 0.4285714, 0.5714286, 0.5714...
## $ dire_ward_observer_count    <dbl> 0.5000000, 0.0000000, 0.1666667, 0.3333...
## $ dire_ward_sentry_count      <dbl> 0.00000000, 0.00000000, 0.00000000, 0.0...
## $ radiant_win                 <fct> False, False, False, True, True, True, ...
## $ radiant_gold                <dbl> 0.6481314, 0.6357171, 0.3212207, 0.6487...
## $ radiant_xp                  <dbl> 0.5800796, 0.7555334, 0.5527232, 0.7387...
## $ radiant_level               <dbl> 0.00000000, 0.00000000, 0.26666667, 0.6...
## $ radiant_lh                  <dbl> 0.4141414, 0.5858586, 0.1818182, 0.6161...
## $ radiant_kills               <dbl> 0.36363636, 0.18181818, 0.00000000, 0.1...
## $ radiant_deaths              <dbl> 0.3846154, 0.4285714, 0.4835165, 0.7142...
## $ dire_gold                   <dbl> 0.4851485, 0.4633663, 0.4905941, 0.5230...
## $ dire_xp                     <dbl> 0.4819911, 0.6572933, 0.6180448, 0.6528...
## $ dire_level                  <dbl> 0.0000000, 0.0000000, 0.3333333, 0.4666...
## $ dire_lh                     <dbl> 0.4411765, 0.4705882, 0.3529412, 0.3725...
## $ dire_kills                  <dbl> 0.00000000, 0.00000000, 0.08333333, 0.0...
## $ dire_deaths                 <dbl> 0.5402299, 0.3563218, 0.4482759, 0.5517...
## $ first.blood.time            <fct> 1, 1, 4, 1, 2, 3, 3, 2, 3, 2, 3, 4, 5, ...
## $ radiant.bottle.time         <fct> 2, 1, 2, 2, 1, 1, 5, 5, 1, 3, 3, 1, 1, ...
## $ radiant.courier.time        <fct> 1, 3, 2, 3, 3, 3, 1, 4, 2, 1, 1, 4, 1, ...
## $ radiant.fly.courier.time    <fct> 4, 1, 4, 2, 5, 5, 3, 5, 5, 4, 2, 4, 1, ...
## $ radiant.first.ward.time     <fct> 3, 2, 1, 3, 1, 2, 1, 5, 2, 1, 2, 5, 2, ...
## $ dire.bottle.time            <fct> 5, 4, 3, 4, 4, 5, 3, 2, 5, 2, 1, 5, 2, ...
## $ dire.courier.time           <fct> 1, 4, 1, 4, 4, 3, 1, 3, 1, 1, 1, 3, 1, ...
## $ dire.fly.courier.time       <fct> 4, 5, 4, 5, 5, 1, 4, 5, 3, 5, 4, 5, 1, ...
## $ dire.first.ward.time        <fct> 4, 5, 1, 3, 3, 3, 1, 2, 1, 1, 1, 3, 2, ...
## $ duration.c                  <fct> late, late, late, late, normal, fast, f...
## $ r.carry                     <int> 2, 3, 4, 3, 4, 2, 2, 3, 2, 3, 3, 3, 2, ...
## $ r.disabler                  <int> 3, 4, 3, 2, 4, 4, 4, 4, 5, 5, 3, 5, 5, ...
## $ r.durable                   <int> 2, 3, 2, 0, 2, 3, 1, 2, 2, 2, 1, 2, 2, ...
## $ r.escape                    <int> 2, 1, 2, 2, 3, 3, 3, 1, 1, 1, 4, 2, 0, ...
## $ r.initiator                 <int> 1, 3, 1, 0, 2, 2, 2, 2, 3, 1, 2, 2, 4, ...
## $ r.jungler                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ r.nuker                     <int> 5, 4, 3, 3, 5, 3, 5, 5, 5, 5, 3, 2, 4, ...
## $ r.pusher                    <int> 0, 1, 3, 2, 2, 0, 2, 1, 1, 0, 0, 2, 0, ...
## $ r.support                   <int> 2, 1, 1, 2, 0, 3, 2, 0, 2, 4, 2, 2, 3, ...
## $ d.carry                     <int> 4, 3, 1, 2, 2, 4, 2, 4, 3, 3, 1, 3, 4, ...
## $ d.disabler                  <int> 3, 5, 4, 4, 4, 3, 3, 4, 4, 2, 3, 2, 4, ...
## $ d.durable                   <int> 3, 2, 2, 3, 1, 1, 2, 2, 2, 2, 3, 2, 3, ...
## $ d.escape                    <int> 2, 2, 0, 4, 3, 2, 0, 3, 3, 0, 1, 2, 2, ...
## $ d.initiator                 <int> 3, 3, 2, 4, 3, 2, 2, 2, 1, 2, 2, 1, 2, ...
## $ d.jungler                   <int> 0, 1, 1, 0, 1, 1, 0, 2, 0, 0, 0, 1, 0, ...
## $ d.nuker                     <int> 2, 3, 4, 4, 5, 4, 4, 2, 4, 3, 4, 4, 3, ...
## $ d.pusher                    <int> 0, 1, 2, 0, 1, 2, 0, 1, 1, 0, 1, 1, 0, ...
## $ d.support                   <int> 2, 1, 2, 0, 3, 2, 2, 2, 1, 1, 1, 2, 1, ...

colSums(is.na(train1.clean))

##                    match_id            first_blood_team 
##                           0                           0 
##         first_blood_player1         first_blood_player2 
##                           0                           0 
##      radiant_tpscroll_count         radiant_boots_count 
##                           0                           0 
## radiant_ward_observer_count   radiant_ward_sentry_count 
##                           0                           0 
##         dire_tpscroll_count            dire_boots_count 
##                           0                           0 
##    dire_ward_observer_count      dire_ward_sentry_count 
##                           0                           0 
##                 radiant_win                radiant_gold 
##                           0                           0 
##                  radiant_xp               radiant_level 
##                           0                           0 
##                  radiant_lh               radiant_kills 
##                           0                           0 
##              radiant_deaths                   dire_gold 
##                           0                           0 
##                     dire_xp                  dire_level 
##                           0                           0 
##                     dire_lh                  dire_kills 
##                           0                           0 
##                 dire_deaths            first.blood.time 
##                           0                           0 
##         radiant.bottle.time        radiant.courier.time 
##                           0                           0 
##    radiant.fly.courier.time     radiant.first.ward.time 
##                           0                           0 
##            dire.bottle.time           dire.courier.time 
##                           0                           0 
##       dire.fly.courier.time        dire.first.ward.time 
##                           0                           0 
##                  duration.c                     r.carry 
##                           0                           0 
##                  r.disabler                   r.durable 
##                           0                           0 
##                    r.escape                 r.initiator 
##                           0                           0 
##                   r.jungler                     r.nuker 
##                           0                           0 
##                    r.pusher                   r.support 
##                           0                           0 
##                     d.carry                  d.disabler 
##                           0                           0 
##                   d.durable                    d.escape 
##                           0                           0 
##                 d.initiator                   d.jungler 
##                           0                           0 
##                     d.nuker                    d.pusher 
##                           0                           0 
##                   d.support 
##                           0

I think its clean enough. lets start modelling

Modeling

Rikimaru image by: chroneco

Splitting

# drop match id, we dont need it for modeling
train1.clean <- train1.clean[,-1]
# split data to train and test for model evaluation
splitted <- initial_split(train1.clean, prop = 0.8, strata = "radiant_win")
trainer1 <- training(splitted)
tester1 <- testing(splitted)

# lets's re-check our target variable and hope its properly distributed
table(trainer1$radiant_win) %>% prop.table()

## 
##     False      True 
## 0.4850094 0.5149906

Logistic Regression

glm.mod <- glm(radiant_win~., family = "binomial", data=trainer1)

summary(glm.mod)

## 
## Call:
## glm(formula = radiant_win ~ ., family = "binomial", data = trainer1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0755  -0.3244   0.0365   0.3262   3.5204  
## 
## Coefficients: (5 not defined because of singularities)
##                               Estimate Std. Error z value             Pr(>|z|)
## (Intercept)                   0.718816   1.471502   0.488             0.625202
## first_blood_team1            -0.046612   0.423357  -0.110             0.912330
## first_blood_team3            -0.442272   0.428499  -1.032             0.302005
## first_blood_player11          0.149778   0.292705   0.512             0.608858
## first_blood_player12         -0.004468   0.310510  -0.014             0.988519
## first_blood_player13          0.203825   0.301222   0.677             0.498622
## first_blood_player14          0.122737   0.294077   0.417             0.676412
## first_blood_player15         -0.208040   0.280786  -0.741             0.458743
## first_blood_player16         -0.332541   0.297536  -1.118             0.263717
## first_blood_player17          0.122867   0.277578   0.443             0.658027
## first_blood_player18          0.466778   0.279362   1.671             0.094747
## first_blood_player19                NA         NA      NA                   NA
## first_blood_player110               NA         NA      NA                   NA
## first_blood_player21          0.147292   0.303031   0.486             0.626922
## first_blood_player22          0.270640   0.293668   0.922             0.356744
## first_blood_player23          0.233564   0.296682   0.787             0.431133
## first_blood_player24          0.364407   0.299440   1.217             0.223619
## first_blood_player25         -0.205978   0.307756  -0.669             0.503311
## first_blood_player26         -0.350375   0.307651  -1.139             0.254756
## first_blood_player27         -0.181992   0.297735  -0.611             0.541032
## first_blood_player28         -0.176838   0.313295  -0.564             0.572451
## first_blood_player29                NA         NA      NA                   NA
## first_blood_player210               NA         NA      NA                   NA
## radiant_tpscroll_count       -0.518914   0.608382  -0.853             0.393691
## radiant_boots_count          -0.282910   0.429236  -0.659             0.509831
## radiant_ward_observer_count  -0.177913   0.514200  -0.346             0.729344
## radiant_ward_sentry_count     0.905908   0.889720   1.018             0.308585
## dire_tpscroll_count           0.431231   0.556386   0.775             0.438305
## dire_boots_count             -0.103685   0.428183  -0.242             0.808662
## dire_ward_observer_count      0.443668   0.538877   0.823             0.410326
## dire_ward_sentry_count       -2.156600   1.560939  -1.382             0.167093
## radiant_gold                  5.683899   1.556447   3.652             0.000260
## radiant_xp                   -1.431557   1.296344  -1.104             0.269462
## radiant_level                -0.487327   0.937268  -0.520             0.603102
## radiant_lh                    0.342452   1.009183   0.339             0.734357
## radiant_kills                -4.284788   0.956614  -4.479           0.00000750
## radiant_deaths              -18.312482   0.733115 -24.979 < 0.0000000000000002
## dire_gold                    -6.617641   1.762258  -3.755             0.000173
## dire_xp                      -0.034236   1.338015  -0.026             0.979587
## dire_level                    0.436975   0.927391   0.471             0.637507
## dire_lh                       1.351413   1.064070   1.270             0.204070
## dire_kills                    5.272674   1.108351   4.757           0.00000196
## dire_deaths                  18.829450   0.752749  25.014 < 0.0000000000000002
## first.blood.time2            -0.228811   0.189707  -1.206             0.227768
## first.blood.time3            -0.193356   0.192778  -1.003             0.315863
## first.blood.time4            -0.269448   0.211740  -1.273             0.203181
## first.blood.time5                   NA         NA      NA                   NA
## radiant.bottle.time2         -0.028713   0.208533  -0.138             0.890485
## radiant.bottle.time3         -0.161015   0.213359  -0.755             0.450448
## radiant.bottle.time4         -0.235664   0.212732  -1.108             0.267950
## radiant.bottle.time5          0.045880   0.205230   0.224             0.823105
## radiant.courier.time2        -0.122975   0.181937  -0.676             0.499090
## radiant.courier.time3        -0.313223   0.216833  -1.445             0.148588
## radiant.courier.time4        -0.279884   0.227065  -1.233             0.217719
## radiant.courier.time5        -1.168399   0.606636  -1.926             0.054101
## radiant.fly.courier.time2     0.163821   0.251895   0.650             0.515463
## radiant.fly.courier.time3    -0.050559   0.360368  -0.140             0.888424
## radiant.fly.courier.time4    -0.036358   0.219201  -0.166             0.868261
## radiant.fly.courier.time5    -0.039463   0.209061  -0.189             0.850277
## radiant.first.ward.time2      0.426156   0.202550   2.104             0.035382
## radiant.first.ward.time3      0.124885   0.182774   0.683             0.494431
## radiant.first.ward.time4      0.031600   0.261804   0.121             0.903929
## radiant.first.ward.time5      0.100577   0.307240   0.327             0.743398
## dire.bottle.time2             0.328931   0.204196   1.611             0.107211
## dire.bottle.time3             0.247075   0.209482   1.179             0.238217
## dire.bottle.time4             0.032483   0.207279   0.157             0.875471
## dire.bottle.time5            -0.427453   0.202618  -2.110             0.034889
## dire.courier.time2            0.350168   0.187806   1.865             0.062248
## dire.courier.time3            0.419507   0.212843   1.971             0.048728
## dire.courier.time4            0.278563   0.227485   1.225             0.220751
## dire.courier.time5            0.774415   0.591901   1.308             0.190754
## dire.fly.courier.time2        0.250124   0.243268   1.028             0.303865
## dire.fly.courier.time3        0.625430   0.336807   1.857             0.063320
## dire.fly.courier.time4        0.228436   0.213497   1.070             0.284630
## dire.fly.courier.time5        0.592647   0.198576   2.984             0.002841
## dire.first.ward.time2        -0.231559   0.196908  -1.176             0.239606
## dire.first.ward.time3        -0.398541   0.190022  -2.097             0.035964
## dire.first.ward.time4        -0.110378   0.250457  -0.441             0.659425
## dire.first.ward.time5        -0.202947   0.298310  -0.680             0.496301
## duration.cnormal             -0.049248   0.174949  -0.282             0.778326
## duration.clate               -0.349623   0.180878  -1.933             0.053246
## r.carry                       0.272834   0.083140   3.282             0.001032
## r.disabler                   -0.161057   0.086884  -1.854             0.063781
## r.durable                    -0.013771   0.078831  -0.175             0.861321
## r.escape                     -0.130976   0.070386  -1.861             0.062769
## r.initiator                   0.022527   0.078665   0.286             0.774600
## r.jungler                    -0.035087   0.112016  -0.313             0.754105
## r.nuker                      -0.112139   0.080619  -1.391             0.164236
## r.pusher                      0.178244   0.079508   2.242             0.024972
## r.support                     0.191040   0.084846   2.252             0.024347
## d.carry                      -0.023684   0.084487  -0.280             0.779232
## d.disabler                    0.165348   0.080708   2.049             0.040490
## d.durable                    -0.146797   0.076943  -1.908             0.056409
## d.escape                      0.072254   0.068421   1.056             0.290958
## d.initiator                  -0.118486   0.076963  -1.540             0.123678
## d.jungler                    -0.090622   0.112680  -0.804             0.421257
## d.nuker                       0.064229   0.081308   0.790             0.429564
## d.pusher                     -0.073671   0.081768  -0.901             0.367602
## d.support                    -0.229808   0.082033  -2.801             0.005088
##                                
## (Intercept)                    
## first_blood_team1              
## first_blood_team3              
## first_blood_player11           
## first_blood_player12           
## first_blood_player13           
## first_blood_player14           
## first_blood_player15           
## first_blood_player16           
## first_blood_player17           
## first_blood_player18        .  
## first_blood_player19           
## first_blood_player110          
## first_blood_player21           
## first_blood_player22           
## first_blood_player23           
## first_blood_player24           
## first_blood_player25           
## first_blood_player26           
## first_blood_player27           
## first_blood_player28           
## first_blood_player29           
## first_blood_player210          
## radiant_tpscroll_count         
## radiant_boots_count            
## radiant_ward_observer_count    
## radiant_ward_sentry_count      
## dire_tpscroll_count            
## dire_boots_count               
## dire_ward_observer_count       
## dire_ward_sentry_count         
## radiant_gold                ***
## radiant_xp                     
## radiant_level                  
## radiant_lh                     
## radiant_kills               ***
## radiant_deaths              ***
## dire_gold                   ***
## dire_xp                        
## dire_level                     
## dire_lh                        
## dire_kills                  ***
## dire_deaths                 ***
## first.blood.time2              
## first.blood.time3              
## first.blood.time4              
## first.blood.time5              
## radiant.bottle.time2           
## radiant.bottle.time3           
## radiant.bottle.time4           
## radiant.bottle.time5           
## radiant.courier.time2          
## radiant.courier.time3          
## radiant.courier.time4          
## radiant.courier.time5       .  
## radiant.fly.courier.time2      
## radiant.fly.courier.time3      
## radiant.fly.courier.time4      
## radiant.fly.courier.time5      
## radiant.first.ward.time2    *  
## radiant.first.ward.time3       
## radiant.first.ward.time4       
## radiant.first.ward.time5       
## dire.bottle.time2              
## dire.bottle.time3              
## dire.bottle.time4              
## dire.bottle.time5           *  
## dire.courier.time2          .  
## dire.courier.time3          *  
## dire.courier.time4             
## dire.courier.time5             
## dire.fly.courier.time2         
## dire.fly.courier.time3      .  
## dire.fly.courier.time4         
## dire.fly.courier.time5      ** 
## dire.first.ward.time2          
## dire.first.ward.time3       *  
## dire.first.ward.time4          
## dire.first.ward.time5          
## duration.cnormal               
## duration.clate              .  
## r.carry                     ** 
## r.disabler                  .  
## r.durable                      
## r.escape                    .  
## r.initiator                    
## r.jungler                      
## r.nuker                        
## r.pusher                    *  
## r.support                   *  
## d.carry                        
## d.disabler                  *  
## d.durable                   .  
## d.escape                       
## d.initiator                    
## d.jungler                      
## d.nuker                        
## d.pusher                       
## d.support                   ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4436.0  on 3201  degrees of freedom
## Residual deviance: 1707.8  on 3108  degrees of freedom
## AIC: 1895.8
## 
## Number of Fisher Scoring iterations: 6

From the summary above we found 5 variables are not defined because of singularities. That’s happen because two or more of the variables are perfectly collinear. To identify which variable is colinear, we can use alias() function to our model

als <- alias(glm.mod)
als$Complete

##                       (Intercept) first_blood_team1 first_blood_team3
## first_blood_player19   0           1                 0               
## first_blood_player110  0           0                 1               
## first_blood_player29   1          -1                -1               
## first_blood_player210  0           0                 1               
## first.blood.time5      0           0                 1               
##                       first_blood_player11 first_blood_player12
## first_blood_player19   0                    0                  
## first_blood_player110  0                    0                  
## first_blood_player29   0                    0                  
## first_blood_player210  0                    0                  
## first.blood.time5      0                    0                  
##                       first_blood_player13 first_blood_player14
## first_blood_player19   0                    0                  
## first_blood_player110  0                    0                  
## first_blood_player29   0                    0                  
## first_blood_player210  0                    0                  
## first.blood.time5      0                    0                  
##                       first_blood_player15 first_blood_player16
## first_blood_player19  -1                   -1                  
## first_blood_player110  0                    0                  
## first_blood_player29   0                    0                  
## first_blood_player210  0                    0                  
## first.blood.time5      0                    0                  
##                       first_blood_player17 first_blood_player18
## first_blood_player19  -1                   -1                  
## first_blood_player110  0                    0                  
## first_blood_player29   0                    0                  
## first_blood_player210  0                    0                  
## first.blood.time5      0                    0                  
##                       first_blood_player21 first_blood_player22
## first_blood_player19   0                    0                  
## first_blood_player110  0                    0                  
## first_blood_player29   0                    0                  
## first_blood_player210  0                    0                  
## first.blood.time5      0                    0                  
##                       first_blood_player23 first_blood_player24
## first_blood_player19   0                    0                  
## first_blood_player110  0                    0                  
## first_blood_player29   0                    0                  
## first_blood_player210  0                    0                  
## first.blood.time5      0                    0                  
##                       first_blood_player25 first_blood_player26
## first_blood_player19   0                    0                  
## first_blood_player110  0                    0                  
## first_blood_player29  -1                   -1                  
## first_blood_player210  0                    0                  
## first.blood.time5      0                    0                  
##                       first_blood_player27 first_blood_player28
## first_blood_player19   0                    0                  
## first_blood_player110  0                    0                  
## first_blood_player29  -1                   -1                  
## first_blood_player210  0                    0                  
## first.blood.time5      0                    0                  
##                       radiant_tpscroll_count radiant_boots_count
## first_blood_player19   0                      0                 
## first_blood_player110  0                      0                 
## first_blood_player29   0                      0                 
## first_blood_player210  0                      0                 
## first.blood.time5      0                      0                 
##                       radiant_ward_observer_count radiant_ward_sentry_count
## first_blood_player19   0                           0                       
## first_blood_player110  0                           0                       
## first_blood_player29   0                           0                       
## first_blood_player210  0                           0                       
## first.blood.time5      0                           0                       
##                       dire_tpscroll_count dire_boots_count
## first_blood_player19   0                   0              
## first_blood_player110  0                   0              
## first_blood_player29   0                   0              
## first_blood_player210  0                   0              
## first.blood.time5      0                   0              
##                       dire_ward_observer_count dire_ward_sentry_count
## first_blood_player19   0                        0                    
## first_blood_player110  0                        0                    
## first_blood_player29   0                        0                    
## first_blood_player210  0                        0                    
## first.blood.time5      0                        0                    
##                       radiant_gold radiant_xp radiant_level radiant_lh
## first_blood_player19   0            0          0             0        
## first_blood_player110  0            0          0             0        
## first_blood_player29   0            0          0             0        
## first_blood_player210  0            0          0             0        
## first.blood.time5      0            0          0             0        
##                       radiant_kills radiant_deaths dire_gold dire_xp dire_level
## first_blood_player19   0             0              0         0       0        
## first_blood_player110  0             0              0         0       0        
## first_blood_player29   0             0              0         0       0        
## first_blood_player210  0             0              0         0       0        
## first.blood.time5      0             0              0         0       0        
##                       dire_lh dire_kills dire_deaths first.blood.time2
## first_blood_player19   0       0          0           0               
## first_blood_player110  0       0          0           0               
## first_blood_player29   0       0          0           0               
## first_blood_player210  0       0          0           0               
## first.blood.time5      0       0          0           0               
##                       first.blood.time3 first.blood.time4 radiant.bottle.time2
## first_blood_player19   0                 0                 0                  
## first_blood_player110  0                 0                 0                  
## first_blood_player29   0                 0                 0                  
## first_blood_player210  0                 0                 0                  
## first.blood.time5      0                 0                 0                  
##                       radiant.bottle.time3 radiant.bottle.time4
## first_blood_player19   0                    0                  
## first_blood_player110  0                    0                  
## first_blood_player29   0                    0                  
## first_blood_player210  0                    0                  
## first.blood.time5      0                    0                  
##                       radiant.bottle.time5 radiant.courier.time2
## first_blood_player19   0                    0                   
## first_blood_player110  0                    0                   
## first_blood_player29   0                    0                   
## first_blood_player210  0                    0                   
## first.blood.time5      0                    0                   
##                       radiant.courier.time3 radiant.courier.time4
## first_blood_player19   0                     0                   
## first_blood_player110  0                     0                   
## first_blood_player29   0                     0                   
## first_blood_player210  0                     0                   
## first.blood.time5      0                     0                   
##                       radiant.courier.time5 radiant.fly.courier.time2
## first_blood_player19   0                     0                       
## first_blood_player110  0                     0                       
## first_blood_player29   0                     0                       
## first_blood_player210  0                     0                       
## first.blood.time5      0                     0                       
##                       radiant.fly.courier.time3 radiant.fly.courier.time4
## first_blood_player19   0                         0                       
## first_blood_player110  0                         0                       
## first_blood_player29   0                         0                       
## first_blood_player210  0                         0                       
## first.blood.time5      0                         0                       
##                       radiant.fly.courier.time5 radiant.first.ward.time2
## first_blood_player19   0                         0                      
## first_blood_player110  0                         0                      
## first_blood_player29   0                         0                      
## first_blood_player210  0                         0                      
## first.blood.time5      0                         0                      
##                       radiant.first.ward.time3 radiant.first.ward.time4
## first_blood_player19   0                        0                      
## first_blood_player110  0                        0                      
## first_blood_player29   0                        0                      
## first_blood_player210  0                        0                      
## first.blood.time5      0                        0                      
##                       radiant.first.ward.time5 dire.bottle.time2
## first_blood_player19   0                        0               
## first_blood_player110  0                        0               
## first_blood_player29   0                        0               
## first_blood_player210  0                        0               
## first.blood.time5      0                        0               
##                       dire.bottle.time3 dire.bottle.time4 dire.bottle.time5
## first_blood_player19   0                 0                 0               
## first_blood_player110  0                 0                 0               
## first_blood_player29   0                 0                 0               
## first_blood_player210  0                 0                 0               
## first.blood.time5      0                 0                 0               
##                       dire.courier.time2 dire.courier.time3 dire.courier.time4
## first_blood_player19   0                  0                  0                
## first_blood_player110  0                  0                  0                
## first_blood_player29   0                  0                  0                
## first_blood_player210  0                  0                  0                
## first.blood.time5      0                  0                  0                
##                       dire.courier.time5 dire.fly.courier.time2
## first_blood_player19   0                  0                    
## first_blood_player110  0                  0                    
## first_blood_player29   0                  0                    
## first_blood_player210  0                  0                    
## first.blood.time5      0                  0                    
##                       dire.fly.courier.time3 dire.fly.courier.time4
## first_blood_player19   0                      0                    
## first_blood_player110  0                      0                    
## first_blood_player29   0                      0                    
## first_blood_player210  0                      0                    
## first.blood.time5      0                      0                    
##                       dire.fly.courier.time5 dire.first.ward.time2
## first_blood_player19   0                      0                   
## first_blood_player110  0                      0                   
## first_blood_player29   0                      0                   
## first_blood_player210  0                      0                   
## first.blood.time5      0                      0                   
##                       dire.first.ward.time3 dire.first.ward.time4
## first_blood_player19   0                     0                   
## first_blood_player110  0                     0                   
## first_blood_player29   0                     0                   
## first_blood_player210  0                     0                   
## first.blood.time5      0                     0                   
##                       dire.first.ward.time5 duration.cnormal duration.clate
## first_blood_player19   0                     0                0            
## first_blood_player110  0                     0                0            
## first_blood_player29   0                     0                0            
## first_blood_player210  0                     0                0            
## first.blood.time5      0                     0                0            
##                       r.carry r.disabler r.durable r.escape r.initiator
## first_blood_player19   0       0          0         0        0         
## first_blood_player110  0       0          0         0        0         
## first_blood_player29   0       0          0         0        0         
## first_blood_player210  0       0          0         0        0         
## first.blood.time5      0       0          0         0        0         
##                       r.jungler r.nuker r.pusher r.support d.carry d.disabler
## first_blood_player19   0         0       0        0         0       0        
## first_blood_player110  0         0       0        0         0       0        
## first_blood_player29   0         0       0        0         0       0        
## first_blood_player210  0         0       0        0         0       0        
## first.blood.time5      0         0       0        0         0       0        
##                       d.durable d.escape d.initiator d.jungler d.nuker d.pusher
## first_blood_player19   0         0        0           0         0       0      
## first_blood_player110  0         0        0           0         0       0      
## first_blood_player29   0         0        0           0         0       0      
## first_blood_player210  0         0        0           0         0       0      
## first.blood.time5      0         0        0           0         0       0      
##                       d.support
## first_blood_player19   0       
## first_blood_player110  0       
## first_blood_player29   0       
## first_blood_player210  0       
## first.blood.time5      0

From the alias we found first_blood_team1 is highly corelated first_blood_player29 as well as first_blood_team3 is highly corelated with first.blood.time5 and each variables pair that have 1/-1 in the matrix. we should remove first_blood_player1, first_blood_player2, and first.blood.time variables that can be problematic with respect to multicollinearity, and re-build the model.

trainer1 <- trainer1 %>% select(-c("first_blood_player1","first_blood_player2","first.blood.time"))
tester1 <- tester1 %>% select(-c("first_blood_player1","first_blood_player2","first.blood.time"))

# re-build the model
glm.modx <- glm(radiant_win~., family = "binomial", data=trainer1)
summary(glm.modx)

## 
## Call:
## glm(formula = radiant_win ~ ., family = "binomial", data = trainer1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1239  -0.3339   0.0384   0.3360   3.5139  
## 
## Coefficients:
##                               Estimate Std. Error z value             Pr(>|z|)
## (Intercept)                   0.573158   1.422340   0.403             0.686972
## first_blood_team1             0.250230   0.160030   1.564             0.117901
## first_blood_team3            -0.148207   0.280242  -0.529             0.596906
## radiant_tpscroll_count       -0.555844   0.602067  -0.923             0.355890
## radiant_boots_count          -0.291132   0.426301  -0.683             0.494654
## radiant_ward_observer_count  -0.199725   0.507097  -0.394             0.693685
## radiant_ward_sentry_count     0.999984   0.883201   1.132             0.257539
## dire_tpscroll_count           0.381178   0.551612   0.691             0.489549
## dire_boots_count             -0.079117   0.423894  -0.187             0.851940
## dire_ward_observer_count      0.389431   0.532573   0.731             0.464641
## dire_ward_sentry_count       -2.292046   1.554027  -1.475             0.140237
## radiant_gold                  5.406592   1.540102   3.511             0.000447
## radiant_xp                   -1.527721   1.272678  -1.200             0.229984
## radiant_level                -0.357128   0.931393  -0.383             0.701398
## radiant_lh                    0.493833   0.993591   0.497             0.619176
## radiant_kills                -3.957127   0.931430  -4.248          0.000021526
## radiant_deaths              -18.206548   0.725634 -25.091 < 0.0000000000000002
## dire_gold                    -6.553445   1.747797  -3.750             0.000177
## dire_xp                      -0.009272   1.320130  -0.007             0.994396
## dire_level                    0.308288   0.919844   0.335             0.737510
## dire_lh                       1.342968   1.053349   1.275             0.202327
## dire_kills                    5.575367   1.075761   5.183          0.000000219
## dire_deaths                  18.714164   0.743607  25.167 < 0.0000000000000002
## radiant.bottle.time2         -0.030721   0.206415  -0.149             0.881685
## radiant.bottle.time3         -0.154340   0.210646  -0.733             0.463743
## radiant.bottle.time4         -0.231778   0.210194  -1.103             0.270164
## radiant.bottle.time5          0.034875   0.202606   0.172             0.863335
## radiant.courier.time2        -0.137913   0.179984  -0.766             0.443525
## radiant.courier.time3        -0.325337   0.214871  -1.514             0.130000
## radiant.courier.time4        -0.281321   0.224270  -1.254             0.209702
## radiant.courier.time5        -1.216532   0.605217  -2.010             0.044423
## radiant.fly.courier.time2     0.155010   0.249647   0.621             0.534655
## radiant.fly.courier.time3    -0.080350   0.358810  -0.224             0.822808
## radiant.fly.courier.time4    -0.044187   0.217117  -0.204             0.838733
## radiant.fly.courier.time5    -0.065200   0.205968  -0.317             0.751582
## radiant.first.ward.time2      0.447264   0.200334   2.233             0.025576
## radiant.first.ward.time3      0.160904   0.180941   0.889             0.373862
## radiant.first.ward.time4      0.076372   0.259180   0.295             0.768249
## radiant.first.ward.time5      0.118952   0.303787   0.392             0.695381
## dire.bottle.time2             0.335997   0.202401   1.660             0.096902
## dire.bottle.time3             0.238087   0.207832   1.146             0.251972
## dire.bottle.time4             0.052673   0.205312   0.257             0.797524
## dire.bottle.time5            -0.370267   0.200481  -1.847             0.064762
## dire.courier.time2            0.344309   0.185793   1.853             0.063856
## dire.courier.time3            0.399966   0.210840   1.897             0.057826
## dire.courier.time4            0.256241   0.224465   1.142             0.253637
## dire.courier.time5            0.680757   0.587337   1.159             0.246433
## dire.fly.courier.time2        0.219441   0.241159   0.910             0.362853
## dire.fly.courier.time3        0.602158   0.331677   1.815             0.069448
## dire.fly.courier.time4        0.211294   0.210944   1.002             0.316508
## dire.fly.courier.time5        0.582036   0.196383   2.964             0.003039
## dire.first.ward.time2        -0.233495   0.195112  -1.197             0.231414
## dire.first.ward.time3        -0.367372   0.187683  -1.957             0.050299
## dire.first.ward.time4        -0.073471   0.247524  -0.297             0.766600
## dire.first.ward.time5        -0.200589   0.295729  -0.678             0.497589
## duration.cnormal             -0.050582   0.173247  -0.292             0.770315
## duration.clate               -0.330714   0.179374  -1.844             0.065225
## r.carry                       0.277567   0.082062   3.382             0.000718
## r.disabler                   -0.156464   0.086069  -1.818             0.069082
## r.durable                    -0.024560   0.078114  -0.314             0.753205
## r.escape                     -0.129086   0.069771  -1.850             0.064294
## r.initiator                   0.016762   0.077776   0.216             0.829366
## r.jungler                    -0.033309   0.111380  -0.299             0.764897
## r.nuker                      -0.125011   0.079819  -1.566             0.117306
## r.pusher                      0.168854   0.078664   2.147             0.031831
## r.support                     0.195588   0.083758   2.335             0.019535
## d.carry                      -0.023203   0.083594  -0.278             0.781342
## d.disabler                    0.159521   0.080342   1.986             0.047085
## d.durable                    -0.147526   0.076180  -1.937             0.052800
## d.escape                      0.067668   0.067785   0.998             0.318150
## d.initiator                  -0.116097   0.076120  -1.525             0.127214
## d.jungler                    -0.080224   0.111269  -0.721             0.470915
## d.nuker                       0.054818   0.080375   0.682             0.495227
## d.pusher                     -0.085410   0.081397  -1.049             0.294040
## d.support                    -0.212370   0.081087  -2.619             0.008818
##                                
## (Intercept)                    
## first_blood_team1              
## first_blood_team3              
## radiant_tpscroll_count         
## radiant_boots_count            
## radiant_ward_observer_count    
## radiant_ward_sentry_count      
## dire_tpscroll_count            
## dire_boots_count               
## dire_ward_observer_count       
## dire_ward_sentry_count         
## radiant_gold                ***
## radiant_xp                     
## radiant_level                  
## radiant_lh                     
## radiant_kills               ***
## radiant_deaths              ***
## dire_gold                   ***
## dire_xp                        
## dire_level                     
## dire_lh                        
## dire_kills                  ***
## dire_deaths                 ***
## radiant.bottle.time2           
## radiant.bottle.time3           
## radiant.bottle.time4           
## radiant.bottle.time5           
## radiant.courier.time2          
## radiant.courier.time3          
## radiant.courier.time4          
## radiant.courier.time5       *  
## radiant.fly.courier.time2      
## radiant.fly.courier.time3      
## radiant.fly.courier.time4      
## radiant.fly.courier.time5      
## radiant.first.ward.time2    *  
## radiant.first.ward.time3       
## radiant.first.ward.time4       
## radiant.first.ward.time5       
## dire.bottle.time2           .  
## dire.bottle.time3              
## dire.bottle.time4              
## dire.bottle.time5           .  
## dire.courier.time2          .  
## dire.courier.time3          .  
## dire.courier.time4             
## dire.courier.time5             
## dire.fly.courier.time2         
## dire.fly.courier.time3      .  
## dire.fly.courier.time4         
## dire.fly.courier.time5      ** 
## dire.first.ward.time2          
## dire.first.ward.time3       .  
## dire.first.ward.time4          
## dire.first.ward.time5          
## duration.cnormal               
## duration.clate              .  
## r.carry                     ***
## r.disabler                  .  
## r.durable                      
## r.escape                    .  
## r.initiator                    
## r.jungler                      
## r.nuker                        
## r.pusher                    *  
## r.support                   *  
## d.carry                        
## d.disabler                  *  
## d.durable                   .  
## d.escape                       
## d.initiator                    
## d.jungler                      
## d.nuker                        
## d.pusher                       
## d.support                   ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4436.0  on 3201  degrees of freedom
## Residual deviance: 1721.8  on 3127  degrees of freedom
## AIC: 1871.8
## 
## Number of Fisher Scoring iterations: 6

The model doesn’t have any problem this time. let’s move to prediction

# predict to test data
tester1$predicted <- predict(glm.modx, type = "response", newdata = tester1)

# using global default 0.5 for treshold
tester1$pred.lab <- ifelse(tester1$predicted > 0.5, "True", "False")

predicted.df <- as.data.frame(cbind("pred.glm1" = tester1$predicted,
                                    "label.glm1" = tester1$pred.lab))

# drop new column in tester
tester1 <- tester1 %>% select(1:49)

# confusion matrix
glm.ev1 <- confusionMatrix(predicted.df$label.glm1, reference = tester1$radiant_win,
                positive = "True")
glm.ev1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction False True
##      False   350   43
##      True     37  368
##                                              
##                Accuracy : 0.8997             
##                  95% CI : (0.8768, 0.9197)   
##     No Information Rate : 0.515              
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.7994             
##                                              
##  Mcnemar's Test P-Value : 0.5762             
##                                              
##             Sensitivity : 0.8954             
##             Specificity : 0.9044             
##          Pos Pred Value : 0.9086             
##          Neg Pred Value : 0.8906             
##              Prevalence : 0.5150             
##          Detection Rate : 0.4612             
##    Detection Prevalence : 0.5075             
##       Balanced Accuracy : 0.8999             
##                                              
##        'Positive' Class : True               
##

KNN

First we need to split data train and test from their label

trainer1.x <- trainer1[,-10]
tester1.x <- tester1[,-c(10)]

trainer1.y <- trainer1[,10]
tester1.y <- tester1[,10]

Note: KNN model only accept numeric type of data. well, we actually can convert all factor into a set of booleans (it was called dummy variables right?). But i think that’s so much to do and we’ll end up to hundreds of variables. so ill drop all variable that is not numeric instead

trainer1.num <- select_if(trainer1.x, is.numeric)
tester1.num <- select_if(tester1.x, is.numeric)

I was tought there’s a common strategy to find K optimum for KNN model. we calculate the square root of the training’s obs number

# finding K optimum
kk <- sqrt(nrow(trainer1))
kk <- round(kk)
kk

## [1] 57

# Knn modeling
knn.mod <- class::knn(train = trainer1.num, test = tester1.num,
               cl = trainer1[,10], k = kk)

# confusion matrix
knn.ev1 <- confusionMatrix(knn.mod, reference = tester1.y,
                           positive = "True")
knn.ev1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction False True
##      False   163  114
##      True    224  297
##                                          
##                Accuracy : 0.5764         
##                  95% CI : (0.5413, 0.611)
##     No Information Rate : 0.515          
##     P-Value [Acc > NIR] : 0.0002876      
##                                          
##                   Kappa : 0.145          
##                                          
##  Mcnemar's Test P-Value : 0.000000003051 
##                                          
##             Sensitivity : 0.7226         
##             Specificity : 0.4212         
##          Pos Pred Value : 0.5701         
##          Neg Pred Value : 0.5884         
##              Prevalence : 0.5150         
##          Detection Rate : 0.3722         
##    Detection Prevalence : 0.6529         
##       Balanced Accuracy : 0.5719         
##                                          
##        'Positive' Class : True           
##

As we can see, glm model is way more better than knn (accuracy 89.7 > 57.3). However, we’ll try to improve both model to achive higher accuracy (or other metrics)

Model improvement

Bane image by: chroneco

Logistic Regression with Stepwise

One way to improve glm model is stepwise. i’ll using backward direction to drop the least contributive predictors. step() function will try to build glm model from every availiable variables then drop it one by one until the model have the lowest AIC. It also drop if one variable is correlated to each other to avoid multicolinearity. AIC estimates the relative amount of information lost by a given model. The less information a model loses, the higher the quality of that model.

# model new glm
glm.mod.2 <- step(glm.modx, direction = "backward", trace = 0)

summary(glm.mod.2)

## 
## Call:
## glm(formula = radiant_win ~ radiant_ward_sentry_count + radiant_gold + 
##     radiant_kills + radiant_deaths + dire_gold + dire_kills + 
##     dire_deaths + dire.bottle.time + dire.fly.courier.time + 
##     duration.c + r.carry + r.disabler + r.escape + r.nuker + 
##     r.pusher + r.support + d.disabler + d.durable + d.initiator + 
##     d.support, family = "binomial", data = trainer1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0618  -0.3464   0.0413   0.3452   3.7217  
## 
## Coefficients:
##                            Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)                -0.87515    0.83830  -1.044             0.296505    
## radiant_ward_sentry_count   1.30272    0.82584   1.577             0.114695    
## radiant_gold                5.52261    0.84924   6.503    0.000000000078719 ***
## radiant_kills              -4.39578    0.60841  -7.225    0.000000000000501 ***
## radiant_deaths            -17.60185    0.69013 -25.505 < 0.0000000000000002 ***
## dire_gold                  -4.48986    0.93544  -4.800    0.000001588779142 ***
## dire_kills                  4.54758    0.66986   6.789    0.000000000011304 ***
## dire_deaths                18.25000    0.70511  25.883 < 0.0000000000000002 ***
## dire.bottle.time2           0.34384    0.19551   1.759             0.078628 .  
## dire.bottle.time3           0.23272    0.19971   1.165             0.243893    
## dire.bottle.time4           0.02020    0.19747   0.102             0.918505    
## dire.bottle.time5          -0.38890    0.19213  -2.024             0.042950 *  
## dire.fly.courier.time2      0.25997    0.23608   1.101             0.270816    
## dire.fly.courier.time3      0.54348    0.32344   1.680             0.092894 .  
## dire.fly.courier.time4      0.20935    0.20396   1.026             0.304697    
## dire.fly.courier.time5      0.55188    0.18313   3.014             0.002582 ** 
## duration.cnormal           -0.04879    0.16839  -0.290             0.772018    
## duration.clate             -0.36884    0.17128  -2.153             0.031282 *  
## r.carry                     0.25744    0.07534   3.417             0.000633 ***
## r.disabler                 -0.15092    0.07364  -2.049             0.040428 *  
## r.escape                   -0.12639    0.06265  -2.017             0.043655 *  
## r.nuker                    -0.10344    0.06915  -1.496             0.134698    
## r.pusher                    0.16918    0.07158   2.364             0.018095 *  
## r.support                   0.23258    0.07569   3.073             0.002120 ** 
## d.disabler                  0.17478    0.07652   2.284             0.022363 *  
## d.durable                  -0.19007    0.06640  -2.863             0.004202 ** 
## d.initiator                -0.11190    0.07173  -1.560             0.118729    
## d.support                  -0.20817    0.07309  -2.848             0.004396 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4436.0  on 3201  degrees of freedom
## Residual deviance: 1761.1  on 3174  degrees of freedom
## AIC: 1817.1
## 
## Number of Fisher Scoring iterations: 6

From the summary above we can conclude that gold, kills, and deaths from both teams are the most significant variable in our model (based on lowest p-value). from hero roles, Carry have the highest significant value then other roles, followed by support and durable. duration.late have high negative influence for radiant winning, means that the longer game tends to make radiant team loses.

For Dota2 players out there we might feel so related from our model summary. From our experience We know that carry and support have very different task in the game and both of them are super important for winning. Tanky heroes (durable) and disabler are also important. having bottle and flying courier in first 5 minutes also have high significant to team winning. Having different roles in team and do all the detailed game mechanics like bottle and courier are needed to win, said our experience and justified by data analytics.

Multicollinearity check
we’re using Variance Inflation Factor (VIF) to check multicollinearity among our variables in model. A common rule of thumbs is that a VIF number greater than 10 may indicate high collinearity and worth further inspection

library(car)
vif(glm.mod.2)

##                               GVIF Df GVIF^(1/(2*Df))
## radiant_ward_sentry_count 1.056419  1        1.027823
## radiant_gold              1.956312  1        1.398682
## radiant_kills             2.051588  1        1.432336
## radiant_deaths            2.922481  1        1.709526
## dire_gold                 2.135187  1        1.461228
## dire_kills                2.236793  1        1.495591
## dire_deaths               3.038014  1        1.742990
## dire.bottle.time          1.181329  4        1.021048
## dire.fly.courier.time     1.137034  4        1.016182
## duration.c                1.502324  2        1.107110
## r.carry                   1.197186  1        1.094160
## r.disabler                1.164305  1        1.079030
## r.escape                  1.094758  1        1.046307
## r.nuker                   1.096136  1        1.046965
## r.pusher                  1.072751  1        1.035737
## r.support                 1.236724  1        1.112081
## d.disabler                1.409389  1        1.187177
## d.durable                 1.175789  1        1.084338
## d.initiator               1.393327  1        1.180393
## d.support                 1.193396  1        1.092426

Looks like our model doesnt have any multicollinearity.

# predict to test data
predicted.df$pred.glm2 <- predict(glm.mod.2, type = "response", newdata = tester1)
# still use 0.5 treshold
predicted.df$label.glm2 <- ifelse(predicted.df$pred.glm2 > 0.5, "True", "False")

# confusion matrix
glm.ev2 <- confusionMatrix(as.factor(predicted.df$label.glm2), reference = tester1$radiant_win,
                positive = "True")
glm.ev2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction False True
##      False   347   47
##      True     40  364
##                                              
##                Accuracy : 0.891              
##                  95% CI : (0.8673, 0.9118)   
##     No Information Rate : 0.515              
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.7819             
##                                              
##  Mcnemar's Test P-Value : 0.5201             
##                                              
##             Sensitivity : 0.8856             
##             Specificity : 0.8966             
##          Pos Pred Value : 0.9010             
##          Neg Pred Value : 0.8807             
##              Prevalence : 0.5150             
##          Detection Rate : 0.4561             
##    Detection Prevalence : 0.5063             
##       Balanced Accuracy : 0.8911             
##                                              
##        'Positive' Class : True               
##

KNN - use additional factor predictor and change the number of K

i’m gonna crazy with this. i decide to convert all the factor column into numeric with dummy_cols. i also use remove_first_dummy to avoid multicolinerity

# select only factor variable to make its dummmies
trainer1.x.fac <- select_if(trainer1.x, is.factor)
tester1.x.fac <- select_if(tester1.x, is.factor)
# build dummy variable
dum.v.train <- dummy_cols(trainer1.x.fac, remove_first_dummy = TRUE)
dum.v.train <- dum.v.train[,11:46]
dum.v.test <- dummy_cols(tester1.x.fac, remove_first_dummy = TRUE)
dum.v.test <- dum.v.test[,11:46]

# bind with numeric df
trainer2.x <- cbind(trainer1.num, dum.v.train)
tester2.x <- cbind(tester1.num, dum.v.test)

# Modell new knn
knn.mod.2 <- class::knn(train = trainer2.x, test = tester2.x,
               cl = trainer1[,10], k = kk)
knn.ev2 <- confusionMatrix(knn.mod.2, reference = tester1.y,
                           positive = "True")
knn.ev2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction False True
##      False   148  103
##      True    239  308
##                                             
##                Accuracy : 0.5714            
##                  95% CI : (0.5363, 0.6061)  
##     No Information Rate : 0.515             
##     P-Value [Acc > NIR] : 0.0007939         
##                                             
##                   Kappa : 0.1332            
##                                             
##  Mcnemar's Test P-Value : 0.0000000000002878
##                                             
##             Sensitivity : 0.7494            
##             Specificity : 0.3824            
##          Pos Pred Value : 0.5631            
##          Neg Pred Value : 0.5896            
##              Prevalence : 0.5150            
##          Detection Rate : 0.3860            
##    Detection Prevalence : 0.6855            
##       Balanced Accuracy : 0.5659            
##                                             
##        'Positive' Class : True              
##

we see that adding additional predictor doesn’t help much. we’ll try to change the K to find the best accuracy. i build a loop function to train knn from k=40 to k=70, build the confusion matrix, and store the accuracy. thanks to this article for the idea

# build the loop from k=40 until k=70 by 1
k_values <- seq(40,70,1)

num_k <- length(k_values)
# make empty table to save accuracy rates 
acc.k.df <- tibble(k = rep(0, num_k), acc = rep(0, num_k))

# evaluate knn for a bunch of values of k
for(i in 1:num_k){
  
  k <- k_values[i]
  # build knn model from given k in loop start from 40 until 70 k
  k.finder <- class::knn(train = trainer2.x, test = tester2.x,
               cl = trainer1[,10], k = k)
  # build confusion matrix (yes, i build knn model and its confusion matrix in every loop iteration
  # it'll be so expensive in the terms of computation, so dont try it with higher obs)
  k.conf <- confusionMatrix(k.finder, reference = tester1.y,
                           positive = "True")
  # store l values to table
  acc.k.df[i, 'k'] <- k
  # store acc values from confusion matrix
  acc.k.df[i, 'acc'] <- k.conf$overall[[1]]
  
}

Let’s draw a plot from the K loop just to make the intepretation fancier

acc.k.df %>% ggplot(aes(x = k, y= acc)) +
  geom_point() + geom_line() + theme_bw() +
  scale_x_continuous(breaks = seq(40,70,2))

Extra note: Do to randomness of KNN algorithm, the plot you see above might be different from what i see when i run the model (and when i knit the rmd). The result are different everytime i run the models. so here’s the picture of my plot when i write this article. From here, the analysis are based from plot i upload below.

from the plot we know that K = 60 have the highest accuracy, then lets re-model our knn

knn.mod.3 <- class::knn(train = trainer2.x, test = tester2.x,
               cl = trainer1[,10], k = 60)
knn.ev3 <- confusionMatrix(knn.mod.2, reference = tester1.y,
                           positive = "True")
knn.ev3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction False True
##      False   148  103
##      True    239  308
##                                             
##                Accuracy : 0.5714            
##                  95% CI : (0.5363, 0.6061)  
##     No Information Rate : 0.515             
##     P-Value [Acc > NIR] : 0.0007939         
##                                             
##                   Kappa : 0.1332            
##                                             
##  Mcnemar's Test P-Value : 0.0000000000002878
##                                             
##             Sensitivity : 0.7494            
##             Specificity : 0.3824            
##          Pos Pred Value : 0.5631            
##          Neg Pred Value : 0.5896            
##              Prevalence : 0.5150            
##          Detection Rate : 0.3860            
##    Detection Prevalence : 0.6855            
##       Balanced Accuracy : 0.5659            
##                                             
##        'Positive' Class : True              
##

oh no! its actually not hahaha. looks like changing K not actually answer our problem. KNN algorithm use Euclidean distance to measure shortest direct route. If we convert catgorical variable to numeric, it’ll only have two option 1 and 0. Thus, the distance will easily biased since the different about two variables are not well separated. So we can say that KNN are not suitable for our case because our data have a lot categoric variables and KNN are not suitable at dealing with categoric variable

Model Evaluation & Conclusion

Bristleback image by: chroneco

Confusion Matrix Combined

# combine all confusion matrix into one dataframe to make it easier to evaluate
eval.glm <- data.frame(t(as.matrix(glm.ev1, what = "classes")))
eval.glm <- cbind(eval.glm, data.frame(t(as.matrix(glm.ev1, what = "overall"))))

eval.knn <- data.frame(t(as.matrix(knn.ev1, what = "classes")))
eval.knn <- cbind(eval.knn, data.frame(t(as.matrix(knn.ev1, what = "overall"))))

eval.glm.2 <- data.frame(t(as.matrix(glm.ev2, what = "classes")))
eval.glm.2 <- cbind(eval.glm.2, data.frame(t(as.matrix(glm.ev2, what = "overall"))))

eval.knn.2 <- data.frame(t(as.matrix(knn.ev2, what = "classes")))
eval.knn.2 <- cbind(eval.knn.2, data.frame(t(as.matrix(knn.ev2, what = "overall"))))

mod.eval <- rbind(eval.glm,eval.knn,eval.glm.2,eval.knn.2)
mod.eval <- mod.eval %>% `row.names<-`(c("glm","knn","glm2","knn2"))


head(mod.eval[,c("Sensitivity","Specificity","Precision","Recall","F1","Accuracy")])

Conclusion

From the table above we know that glm is way better than knn even though i already input all variable and change the K in KNN. This is because KNN are not suitable at dealing with categoric variable and somehow half of our variables are categorical. The glm without stepwise have the best metrics among all models. Our glm imporvement with backward stepwise actually doesn’t improve any metrics evaluation, but we can say that the model have better quality because of its lower AIC. It’ll always possible to have higher accuracy (or other metrics) if we try another classification model. We’ll do that in the future.

The most common use case in prediction dota2 winner is for gambling/bet, right? regardless of the bad thing of gambling addiction, it is important for us to look at the precision value. We need to limit the number of false positive so that our investment (aka money in gambling) didn’t get much wasted. So if we prioritize model with higher precision value, in this case, we’ll use Logistic Regression without backward stepwise model. (even though the model only slightly better)

Thank you !

Shadow Fiend image by: chroneco

DotaScience

jojoecp

4/4/2020