df <-read.csv("dota2.csv")
The Data set I worked on is about online MOBA game called Dota2 (Defense of The Ancient 2). First it was just a mod for Blizzard Entertainment’s most popular game called Warcraft III: Frozen Throne, then it became more popular than the game itself and later on early 2010 Valve bought the rights for the mod and made it into a full fledge game as we now known as Dota2.
For this project I pulled it’s data set from Kaggle (link: https://www.kaggle.com/devinanzelmo/dota-2-matches ). Its data set is massive and I only took like 5 variables including categorical and quantitative data. My categorical Variables are from match.csv -> radiant_win, ability_ids.csv -> ability_name and purchase_id -> item_id and my quantitative variables are from match.csv -> duration and match.csv -> first_blood_duration. These are the variables that I will be using for my analysis.
For this project, I sorted out the data that I want for my project. Then I was able to go into next step, but I was not able to extract the required data then I had to download all the data from https://www.kaggle.com/devinanzelmo/dota-2-matches. After download whole dataset then I made a new .csv file which would include all of my categorical and quantative datas namely radiant_in, ability_name, item_id, duration and first_blood_time. After sorting everything in my .csv file then I uploaded the dataset into RStudio Cloud.
To show the label and position of the text and color in the graph. I used ‘main’ for the top text, ‘xlab’ for bottom text, ‘ylab’ for the left text and ‘col’ for the colour of the graph and ‘horizontal’ to position the graph’s angle.
The following are the data which have been organized into number from 1 to 6 indcluding what kind of data I am dealing with and whatnot, there are bunch of graphs, plots, summary, texts and tables of different variables.
So, for my project I had to select one of the quantitative variable and have to find it’s mean, standard deviation and it’s five number summary. Then I had to show each graphical displays which are histogram, box plot and qqplot. And also had to find if there are any outliers using IQR and what kind of distribution does the variable have.
This chunk of code basically returns mean, standard deviation and five number summary respectively of variable Duration.
mean(df$duration,trim = 0, na.rm = FALSE)
## [1] 2569.424
sd(df$duration)
## [1] 668.4912
summary(df$duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1443 2116 2461 2569 2946 5344
So, there are three graphical displays histogram, box plot and qqplot including if there are any outlier and what kind of distrubution does this variable has. For me this section was the hardest because I had to find out how to do each of the following and took me a while to figure out things and was fun.
Showing histogram was pretty easy but I had to figure out how to show the lines in the graph for finding the distribution for the variable. First i used hist() function using probability then I used lines(density()) function and binded it with the duration for it to show the lines on the histogram.
hist(df$duration,
main = "Dota2 game Length (Positive Skewed Distribution)",
xlab = "Game duration (sec)",
col = "pink",
probability = TRUE)
lines(density(df$duration),col = "blue")
This was probably the easiest plot for me to graph as I only had to call the qqnorm() function and then qqline() function for graphical display of a line in the plot.
qqnorm(df$duration,
main = "QQPlot (Positive Skewed Distribution)",
ylab = "Duration (sec)")
qqline(df$duration)
This is also an easy plot as I only had to call the boxplot() function and organize the way i wanted using horizontal.
boxplot(df$duration,
main ="BoxPlot (Positive Skewed Distribution) ",
horizontal = TRUE,
xlab = "Duration (sec)")
For the outliers I had to use IQR() function for the variable to find the data which are not in the range. There are about six data which are not in the range in the graph. We can also see some of the data leaving the range in the graph and plot but using IQR() function it was easy to find out what were those data as we cannot really pinpoint the data in the graphical display. I also had to use different function to show the table like display using boxplot.stats() function and also which() function to show the result in clean and crisp manner.
IQR(df$duration, na.rm = FALSE)
## [1] 829.5
out <- boxplot.stats(df$duration)$out
out_ind <- which(df$duration %in% c(out))
out_ind
## [1] 45 81 139 145 184 242
df[out_ind,]
## ability_name radiant_win item_id duration first_blood_time
## 45 mirana_arrow TRUE 29 4348 210
## 81 razor_unstable_current TRUE 59 4208 135
## 139 riki_smoke_screen FALSE 46 4295 177
## 145 enigma_midnight_pulse FALSE 60 5344 14
## 184 pugna_decrepify TRUE 46 5267 84
## 242 life_stealer_rage TRUE 46 4494 89
These are the outliers of the variable duration.
I used two variables duration and first_blood_time to plot their relation and find their correlation. I used plot() function which takes two numeric variables x and y.
plot(df$duration,df$first_blood_time,
main = "Multiple variable Graphical Display",
xlab = "Duration (sec)",
ylab ="First blood Time (sec)",
col = "blue",
pch = 19)
“Correlation: From the scatter plot we can assume that the First Blood time is densely populated around 2000s - 3000s(mid game) in duration. So, we can say that around mid game there is higher chance of getting first blood than during early and late game.”
I have two tables frequency and relative frequency. The frequency table only shows the integer value of categorical variable which in this case was for radiant_win that only contains TRUE and FALSE. Relative frequency table contains the percentage or floating value of the categorical variables which is used for good data analyis.
table(df$radiant_win)
##
## FALSE TRUE
## 118 132
table(df$radiant_win)/length(df$radiant_win)
##
## FALSE TRUE
## 0.472 0.528
cat("\n")
Data Analysis from Relative Frequency Table
47.2% of all values in the data set are FALSE. 52.8% of all values in the data set are TRUE.
For this part I used table() function and assigned two of my categorical variables and assigned it into two_way_table and then displayed the table.
two_way_table <- table(df$ability_name,df$radiant_win)
two_way_table
##
## FALSE TRUE
## ability_base 0 1
## antimage_blink 0 1
## antimage_mana_break 1 0
## antimage_mana_void 0 1
## antimage_spell_shield 0 1
## attribute_bonus 1 0
## axe_battle_hunger 1 0
## axe_berserkers_call 0 1
## axe_counter_helix 1 0
## axe_culling_blade 0 1
## bane_brain_sap 1 0
## bane_enfeeble 1 0
## bane_fiends_grip 0 1
## bane_nightmare 0 1
## beastmaster_boar_poison 1 0
## beastmaster_call_of_the_wild 1 0
## beastmaster_hawk_invisibility 1 0
## beastmaster_inner_beast 1 0
## beastmaster_primal_roar 0 1
## beastmaster_wild_axes 1 0
## bloodseeker_blood_bath 1 0
## bloodseeker_bloodrage 1 0
## bloodseeker_rupture 1 0
## bloodseeker_thirst 0 1
## courier_burst 0 1
## courier_return_stash_items 0 1
## courier_return_to_base 1 0
## courier_shield 0 1
## courier_take_stash_items 0 1
## courier_transfer_items 0 1
## crystal_maiden_brilliance_aura 0 1
## crystal_maiden_crystal_nova 1 0
## crystal_maiden_freezing_field 1 0
## crystal_maiden_frostbite 1 0
## dark_seer_ion_shell 0 1
## dark_seer_surge 0 1
## dark_seer_vacuum 0 1
## dazzle_poison_touch 1 0
## dazzle_shadow_wave 0 1
## dazzle_shallow_grave 0 1
## dazzle_weave 0 1
## death_prophet_carrion_swarm 0 1
## death_prophet_exorcism 1 0
## death_prophet_silence 0 1
## death_prophet_witchcraft 0 1
## default_attack 1 0
## dragon_knight_breathe_fire 1 0
## dragon_knight_dragon_blood 0 1
## dragon_knight_dragon_tail 1 0
## dragon_knight_elder_dragon_form 1 0
## dragon_knight_frost_breath 0 1
## drow_ranger_frost_arrows 0 1
## drow_ranger_marksmanship 0 1
## drow_ranger_silence 0 1
## drow_ranger_trueshot 0 1
## earthshaker_aftershock 0 1
## earthshaker_echo_slam 0 1
## earthshaker_enchant_totem 1 0
## earthshaker_fissure 0 1
## enigma_black_hole 1 0
## enigma_demonic_conversion 1 0
## enigma_malefice 1 0
## enigma_midnight_pulse 1 0
## faceless_void_backtrack 0 1
## faceless_void_chronosphere 0 1
## faceless_void_time_lock 1 0
## faceless_void_time_walk 1 0
## furion_force_of_nature 1 0
## furion_sprout 1 0
## furion_teleportation 1 0
## furion_wrath_of_nature 0 1
## Invoker_sun_strike 1 0
## juggernaut_blade_dance 1 0
## juggernaut_blade_fury 0 1
## juggernaut_healing_ward 1 0
## juggernaut_omni_slash 0 1
## kunkka_ghostship 1 0
## kunkka_return 0 1
## kunkka_tidebringer 0 1
## kunkka_torrent 1 0
## kunkka_x_marks_the_spot 1 0
## leshrac_diabolic_edict 0 1
## leshrac_lightning_storm 1 0
## leshrac_pulse_nova 0 1
## leshrac_split_earth 0 1
## lich_chain_frost 1 0
## lich_dark_ritual 1 0
## lich_frost_armor 0 1
## lich_frost_nova 0 1
## life_stealer_consume 1 0
## life_stealer_feast 0 1
## life_stealer_infest 0 1
## life_stealer_open_wounds 0 1
## life_stealer_rage 0 1
## lina_dragon_slave 0 1
## lina_fiery_soul 1 0
## lina_laguna_blade 1 0
## lina_light_strike_array 0 1
## lion_finger_of_death 1 0
## lion_impale 1 0
## lion_mana_drain 0 1
## lion_voodoo 0 1
## luna_eclipse 0 1
## luna_lucent_beam 1 0
## luna_lunar_blessing 0 1
## luna_moon_glaive 0 1
## mirana_arrow 0 1
## mirana_invis 0 1
## mirana_leap 1 0
## mirana_starfall 0 1
## morphling_adaptive_strike 0 1
## morphling_morph 1 0
## morphling_morph_agi 1 0
## morphling_morph_replicate 0 1
## morphling_morph_str 1 0
## morphling_replicate 1 0
## morphling_waveform 0 1
## necrolyte_death_pulse 0 1
## necrolyte_heartstopper_aura 0 1
## necrolyte_reapers_scythe 1 0
## necrolyte_sadist 1 0
## necronomicon_archer_aoe 1 0
## necronomicon_archer_mana_burn 0 1
## necronomicon_warrior_last_will 0 1
## necronomicon_warrior_mana_burn 0 1
## necronomicon_warrior_sight 1 0
## nevermore_dark_lord 1 0
## nevermore_necromastery 1 0
## nevermore_requiem 1 0
## nevermore_shadowraze1 0 1
## nevermore_shadowraze2 0 1
## nevermore_shadowraze3 1 0
## phantom_assassin_blur 0 1
## phantom_assassin_coup_de_grace 1 0
## phantom_assassin_phantom_strike 1 0
## phantom_assassin_stifling_dagger 1 0
## phantom_lancer_doppelwalk 0 1
## phantom_lancer_juxtapose 1 0
## phantom_lancer_phantom_edge 0 1
## phantom_lancer_spirit_lance 0 1
## puck_dream_coil 0 1
## puck_ethereal_jaunt 0 1
## puck_illusory_orb 0 1
## puck_phase_shift 0 1
## puck_waning_rift 0 1
## pudge_dismember 1 0
## pudge_flesh_heap 0 1
## pudge_meat_hook 1 0
## pudge_rot 0 1
## pugna_decrepify 0 1
## pugna_life_drain 1 0
## pugna_nether_blast 0 1
## pugna_nether_ward 1 0
## queenofpain_blink 0 1
## queenofpain_scream_of_pain 0 1
## queenofpain_shadow_strike 0 1
## queenofpain_sonic_wave 0 1
## rattletrap_battery_assault 0 1
## rattletrap_hookshot 1 0
## rattletrap_power_cogs 0 1
## rattletrap_rocket_flare 0 1
## razor_eye_of_the_storm 0 1
## razor_plasma_field 1 0
## razor_static_link 0 1
## razor_unstable_current 0 1
## riki_blink_strike 0 1
## riki_permanent_invisibility 0 1
## riki_smoke_screen 1 0
## riki_tricks_of_the_trade 1 0
## roshan_bash 0 1
## roshan_devotion 1 0
## roshan_inherent_buffs 1 0
## roshan_slam 1 0
## roshan_spell_block 0 1
## sandking_burrowstrike 1 0
## sandking_caustic_finale 1 0
## sandking_epicenter 1 0
## sandking_sand_storm 0 1
## shadow_shaman_ether_shock 0 1
## shadow_shaman_mass_serpent_ward 0 1
## shadow_shaman_shackles 0 1
## shadow_shaman_voodoo 1 0
## skeleton_king_hellfire_blast 1 0
## skeleton_king_mortal_strike 0 1
## skeleton_king_reincarnation 0 1
## skeleton_king_vampiric_aura 1 0
## slardar_amplify_damage 1 0
## slardar_bash 0 1
## slardar_slithereen_crush 1 0
## slardar_sprint 1 0
## sniper_assassinate 0 1
## sniper_headshot 1 0
## sniper_shrapnel 0 1
## sniper_take_aim 1 0
## storm_spirit_ball_lightning 0 1
## storm_spirit_electric_vortex 1 0
## storm_spirit_overload 0 1
## storm_spirit_static_remnant 1 0
## sven_gods_strength 1 0
## sven_great_cleave 1 0
## sven_storm_bolt 1 0
## sven_warcry 1 0
## templar_assassin_meld 0 1
## templar_assassin_psi_blades 1 0
## templar_assassin_psionic_trap 1 0
## templar_assassin_refraction 1 0
## templar_assassin_self_trap 1 0
## templar_assassin_trap 1 0
## tidehunter_anchor_smash 1 0
## tidehunter_gush 0 1
## tidehunter_kraken_shell 0 1
## tidehunter_ravage 0 1
## tinker_heat_seeking_missile 1 0
## tinker_laser 0 1
## tinker_march_of_the_machines 0 1
## tinker_rearm 0 1
## tiny_avalanche 0 1
## tiny_craggy_exterior 1 0
## tiny_grow 0 1
## tiny_toss 0 1
## vengefulspirit_command_aura 0 1
## vengefulspirit_magic_missile 1 0
## vengefulspirit_nether_swap 1 0
## vengefulspirit_wave_of_terror 1 0
## venomancer_plague_ward 1 0
## venomancer_poison_nova 0 1
## venomancer_poison_sting 1 0
## venomancer_venomous_gale 0 1
## viper_corrosive_skin 0 1
## viper_nethertoxin 0 1
## viper_poison_attack 1 0
## viper_viper_strike 0 1
## warlock_fatal_bonds 1 0
## warlock_golem_flaming_fists 0 1
## warlock_golem_permanent_immolation 1 0
## warlock_rain_of_chaos 1 0
## warlock_shadow_word 0 1
## warlock_upheaval 0 1
## windrunner_focusfire 1 0
## windrunner_powershot 1 0
## windrunner_shackleshot 1 0
## windrunner_windrun 1 0
## witch_doctor_death_ward 0 1
## witch_doctor_maledict 0 1
## witch_doctor_paralyzing_cask 0 1
## witch_doctor_voodoo_restoration 1 0
## zuus_arc_lightning 0 1
## zuus_lightning_bolt 1 0
## zuus_static_field 1 0
## zuus_thundergods_wrath 0 1
They have equal proportion to each other.
Relationship between two Variables
The game revolves around the Heroes, their ability, items and their play style or skills of the players and many other independent variables. But the relationship between ability_name and radiant_win is also strong in this case. There are many games in which any ability_name can wipe out opponent team which can cause the radiant_win to be TRUE in this case.
For this part I had to show one categorical and one quantitative variable’s plot side by side and use their summary statics to compare groups. To plot two graphs side by side I used par(mfrow=c(1,2)) function for plotting 1:2 row and column which basically plot two of my different variables side by side. Then I used plot() function for each of my variables then I used summary() function to find their five number summary.
par(mfrow=c(1,2))
plot(df$duration,
main="Scatterplot of Duration (sec)",
xlab = "",
ylab = "Duration")
plot(df$item_id,
main="Scatterplot of Item Id",
xlab = "",
ylab = "Item ID")
print("Duration Summary: \n")
## [1] "Duration Summary: \n"
summary(df$duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1443 2116 2461 2569 2946 5344
print("\nItem Id Summary: \n")
## [1] "\nItem Id Summary: \n"
summary(df$item_id)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 25.00 46.00 56.22 55.50 218.00
We can see that their summary is very different. Duration is basically the time it took for the game to finish and has bigger value cause it is in seconds and item id is just the assigned id. From the summary we can compare that during min. 1443 sec, item id 2 were being purchased and during 1st quartile item 25, during median item 46 was being purchased and so on. They are gradually increasing in ordinal manner and it is easy to compare both of their summary statistics.
#Visualization of Data
I had to find a variable which was not used for graphical display and item_id variable was the perfect data to show the different graphical representation. I have three visualization of Data barplot, scatter plot and a heatmap. For barplot and scatter plot i just had to call their function barplot() and scatter.smooth() to show thier visualization and was pretty easy.
For heatmap I had to use require(ggplot2) library for better result. I showed all of my variable’s heatmap for this part. First, I had to assign my ‘dota2.csv’ file to data and then make a matrix binding data for the visualization and assign it to data again to store my values. Then I used heatmap() function and organized it using different values.
barplot(df$item_id,
main = "Bar Chart",
xlab = "Item Id")
scatter.smooth(df$item_id,
main = "Scatter Plot")
require(ggplot2)
data <- read.csv("dota2.csv", header = TRUE)
data <- data.matrix(data[,-1])
heatmap(t(data),
main = "Heat Map",
Rowv = NA,
Colv = NA,
col = heat.colors(200,alpha = 1,rev = FALSE),
scale = "row")
summary(df$item_id)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 25.00 46.00 56.22 55.50 218.00
In conclusion, the most interesting features of my data was graphs, tables and calculations. I can say that graphs are the main reason to use RStudio because they provide so much information and how we can manipulate the data according to our needs. Its like making whole new thing just from a function which can be called anywhere in the application. And making tables and doing calculation is also very interesting and easy to perform as it is the same call and put data. so, for me this project was pretty challenging because I didn’t knew how to use RStudio cloud and had to learn each and every steps from course materials, stack overflow, github issue forum and from other sources. but at the end it was all worth it because I learned something new which is very interesting and good for my personal growth.
df <- df %>% mutate(
radiant_win = factor(radiant_win == TRUE,
levels = c(TRUE,FALSE),
labels = c('Win','Lose'))
)
I am converting my variable radiant_win into factor.
library(rpart)
library(rpart.plot)
tree <- rpart(radiant_win~., data = df)
tree
## n= 250
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 250 118 Win (0.5280000 0.4720000)
## 2) ability_name=ability_base,antimage_blink,antimage_mana_void,antimage_spell_shield,axe_berserkers_call,axe_culling_blade,bane_fiends_grip,bane_nightmare,beastmaster_primal_roar,bloodseeker_thirst,courier_burst,courier_return_stash_items,courier_shield,courier_take_stash_items,courier_transfer_items,crystal_maiden_brilliance_aura,dark_seer_ion_shell,dark_seer_surge,dark_seer_vacuum,dazzle_shadow_wave,dazzle_shallow_grave,dazzle_weave,death_prophet_carrion_swarm,death_prophet_silence,death_prophet_witchcraft,dragon_knight_dragon_blood,dragon_knight_frost_breath,drow_ranger_frost_arrows,drow_ranger_marksmanship,drow_ranger_silence,drow_ranger_trueshot,earthshaker_aftershock,earthshaker_echo_slam,earthshaker_fissure,faceless_void_backtrack,faceless_void_chronosphere,furion_wrath_of_nature,juggernaut_blade_fury,juggernaut_omni_slash,kunkka_return,kunkka_tidebringer,leshrac_diabolic_edict,leshrac_pulse_nova,leshrac_split_earth,lich_frost_armor,lich_frost_nova,life_stealer_feast,life_stealer_infest,life_stealer_open_wounds,life_stealer_rage,lina_dragon_slave,lina_light_strike_array,lion_mana_drain,lion_voodoo,luna_eclipse,luna_lunar_blessing,luna_moon_glaive,mirana_arrow,mirana_invis,mirana_starfall,morphling_adaptive_strike,morphling_morph_replicate,morphling_waveform,necrolyte_death_pulse,necrolyte_heartstopper_aura,necronomicon_archer_mana_burn,necronomicon_warrior_last_will,necronomicon_warrior_mana_burn,nevermore_shadowraze1,nevermore_shadowraze2,phantom_assassin_blur,phantom_lancer_doppelwalk,phantom_lancer_phantom_edge,phantom_lancer_spirit_lance,puck_dream_coil,puck_ethereal_jaunt,puck_illusory_orb,puck_phase_shift,puck_waning_rift,pudge_flesh_heap,pudge_rot,pugna_decrepify,pugna_nether_blast,queenofpain_blink,queenofpain_scream_of_pain,queenofpain_shadow_strike,queenofpain_sonic_wave,rattletrap_battery_assault,rattletrap_power_cogs,rattletrap_rocket_flare,razor_eye_of_the_storm,razor_static_link,razor_unstable_current,riki_blink_strike,riki_permanent_invisibility,roshan_bash,roshan_spell_block,sandking_sand_storm,shadow_shaman_ether_shock,shadow_shaman_mass_serpent_ward,shadow_shaman_shackles,skeleton_king_mortal_strike,skeleton_king_reincarnation,slardar_bash,sniper_assassinate,sniper_shrapnel,storm_spirit_ball_lightning,storm_spirit_overload,templar_assassin_meld,tidehunter_gush,tidehunter_kraken_shell,tidehunter_ravage,tinker_laser,tinker_march_of_the_machines,tinker_rearm,tiny_avalanche,tiny_grow,tiny_toss,vengefulspirit_command_aura,venomancer_poison_nova,venomancer_venomous_gale,viper_corrosive_skin,viper_nethertoxin,viper_viper_strike,warlock_golem_flaming_fists,warlock_shadow_word,warlock_upheaval,witch_doctor_death_ward,witch_doctor_maledict,witch_doctor_paralyzing_cask,zuus_arc_lightning,zuus_thundergods_wrath 132 0 Win (1.0000000 0.0000000) *
## 3) ability_name=antimage_mana_break,attribute_bonus,axe_battle_hunger,axe_counter_helix,bane_brain_sap,bane_enfeeble,beastmaster_boar_poison,beastmaster_call_of_the_wild,beastmaster_hawk_invisibility,beastmaster_inner_beast,beastmaster_wild_axes,bloodseeker_blood_bath,bloodseeker_bloodrage,bloodseeker_rupture,courier_return_to_base,crystal_maiden_crystal_nova,crystal_maiden_freezing_field,crystal_maiden_frostbite,dazzle_poison_touch,death_prophet_exorcism,default_attack,dragon_knight_breathe_fire,dragon_knight_dragon_tail,dragon_knight_elder_dragon_form,earthshaker_enchant_totem,enigma_black_hole,enigma_demonic_conversion,enigma_malefice,enigma_midnight_pulse,faceless_void_time_lock,faceless_void_time_walk,furion_force_of_nature,furion_sprout,furion_teleportation,Invoker_sun_strike,juggernaut_blade_dance,juggernaut_healing_ward,kunkka_ghostship,kunkka_torrent,kunkka_x_marks_the_spot,leshrac_lightning_storm,lich_chain_frost,lich_dark_ritual,life_stealer_consume,lina_fiery_soul,lina_laguna_blade,lion_finger_of_death,lion_impale,luna_lucent_beam,mirana_leap,morphling_morph,morphling_morph_agi,morphling_morph_str,morphling_replicate,necrolyte_reapers_scythe,necrolyte_sadist,necronomicon_archer_aoe,necronomicon_warrior_sight,nevermore_dark_lord,nevermore_necromastery,nevermore_requiem,nevermore_shadowraze3,phantom_assassin_coup_de_grace,phantom_assassin_phantom_strike,phantom_assassin_stifling_dagger,phantom_lancer_juxtapose,pudge_dismember,pudge_meat_hook,pugna_life_drain,pugna_nether_ward,rattletrap_hookshot,razor_plasma_field,riki_smoke_screen,riki_tricks_of_the_trade,roshan_devotion,roshan_inherent_buffs,roshan_slam,sandking_burrowstrike,sandking_caustic_finale,sandking_epicenter,shadow_shaman_voodoo,skeleton_king_hellfire_blast,skeleton_king_vampiric_aura,slardar_amplify_damage,slardar_slithereen_crush,slardar_sprint,sniper_headshot,sniper_take_aim,storm_spirit_electric_vortex,storm_spirit_static_remnant,sven_gods_strength,sven_great_cleave,sven_storm_bolt,sven_warcry,templar_assassin_psi_blades,templar_assassin_psionic_trap,templar_assassin_refraction,templar_assassin_self_trap,templar_assassin_trap,tidehunter_anchor_smash,tinker_heat_seeking_missile,tiny_craggy_exterior,vengefulspirit_magic_missile,vengefulspirit_nether_swap,vengefulspirit_wave_of_terror,venomancer_plague_ward,venomancer_poison_sting,viper_poison_attack,warlock_fatal_bonds,warlock_golem_permanent_immolation,warlock_rain_of_chaos,windrunner_focusfire,windrunner_powershot,windrunner_shackleshot,windrunner_windrun,witch_doctor_voodoo_restoration,zuus_lightning_bolt,zuus_static_field 118 0 Lose (0.0000000 1.0000000) *
rpart.plot(tree,
main = "Match outcome\n",
shadow.col = "gray",
under = TRUE,
extra = 2,
tweak = 1.2,
)
I imported two libraries ‘rpart’ and ‘rpart.plot’ to create visualization on decision tree.
Then I used the decision tree to predict the dataset that we are going to be using for rest of the project.
pred <- predict(tree,df,type = "class")
head(pred)
## 1 2 3 4 5 6
## Win Lose Lose Lose Win Win
## Levels: Win Lose
Each has been classified into its category.
predict(tree, df) %>%
head()
## Win Lose
## 1 1 0
## 2 0 1
## 3 0 1
## 4 0 1
## 5 1 0
## 6 1 0
We use the following data to create Confusion Matrix.
confusion_table <- with(df,table(radiant_win, pred))
confusion_table
## pred
## radiant_win Win Lose
## Win 132 0
## Lose 0 118
Confusion matrix is used to predict the accuracy and precision of the dataset. In my model the confusion matrix predicted the win and lose values accurately. The win is 132 which falls on True Positive, O True Negative, O on False Positive and 118 on False Positive. So, this gives me accurate prediction for continuing my project while being correctly classified.
I will now use cross validation if the data withhold the same result by splitting the data into thirds for training and testing
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
training <- createDataPartition(y = df$first_blood_time, p = .66, list = FALSE)
df_train <- df %>% slice(training)
df_test <- df %>% slice(-training)
dim(df_train)
## [1] 167 5
dim(df_test)
## [1] 83 5
For cross validation process I used createDataPartition() function to train and test the data into thirds of the full dataset. The model is not overfit becuase the training data is not low compared to the testing data; thus we can say that the model is not overfit.
Using training set to build my model and then test it. In my original tree, ability_name is very important variable. I am going to remove it so that I can check if the new prediction holds the same result as my original prediction.
tree_from_train <- rpart(radiant_win~.,
data = subset(df_train, select = c(-ability_name)))
pred_test <- predict(tree_from_train,subset(df_test, select = c(-ability_name)),type = "class")
with(df_test, table(radiant_win, pred_test))
## pred_test
## radiant_win Win Lose
## Win 20 18
## Lose 31 14
As we can see the prediction data is off if we take out the important variable.
Now I will create full tree.
df_no_ability <- subset(df, select = c(-ability_name))
tree_full <- sample_n(df_no_ability,180) %>%
rpart(radiant_win~., data = ., control = rpart.control(minsplit = 2, cp =0))
rpart.plot(tree_full, extra = 2, roundint = FALSE,
box.palette = list("Gn", "Bu"),
)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
Now, that’s a lot of data and looks difficult to interpret!
We have 250 data out of which 180 are perfectly classified and remaining 70 are mis-classified.
pred_full <- predict(tree_full, df_no_ability, type = "class")
with(df, table(radiant_win, pred_full))
## pred_full
## radiant_win Win Lose
## Win 115 17
## Lose 14 104
looking good, but we still have some data that are mis-classified.
I will now use chi-squared test for significance.
library(FSelector)
weights <- df %>% chi.squared(radiant_win~., data =.) %>%
as_tibble(rownames = "feature") %>%
arrange(desc(attr_importance))
weights
## # A tibble: 4 × 2
## feature attr_importance
## <chr> <dbl>
## 1 ability_name 1
## 2 item_id 0
## 3 duration 0
## 4 first_blood_time 0
ggplot(weights,
aes(x = attr_importance, y = reorder(feature,attr_importance))) +
geom_bar(stat = "identity") +
xlab("Importance score") + ylab("Feature")
I used chi-squared statistic to find the importance of the feature but my variable only showed ability_name as the most important among other features.
I did not got the satisfying result using chi-squared statistics so,I am using another method to test significance on the model.
imp <- varImp(tree)
head(imp)
## Overall
## ability_name 124.608000
## duration 4.394060
## first_blood_time 3.422815
## item_id 2.361191
imp %>% ggplot(aes(x = row.names(imp),weight = Overall))+
geom_bar() + xlab("Feature") +ylab("Importance Score")
As we can see that varImp() function showed different result than chi-squared statistics. In this method, ability_name has the highest importance among other varaiables. So, this confirms the test significance that ability_name has more weight compared to other variables.