About the Data:

I got this Pokemon dataset from Kaggle. This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed.

Importing the Pokemon Data

library(Zelig)
## Loading required package: survival
library(ZeligChoice)
library(faraway)
## 
## Attaching package: 'faraway'
## The following objects are masked from 'package:survival':
## 
##     rats, solder
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(survival)
library(readr)

Pokemon=read_csv("Desktop/Pokemon.csv")
## Parsed with column specification:
## cols(
##   `#` = col_double(),
##   Name = col_character(),
##   `Type 1` = col_character(),
##   `Type 2` = col_character(),
##   Total = col_double(),
##   HP = col_double(),
##   Attack = col_double(),
##   Defense = col_double(),
##   `Sp. Atk` = col_double(),
##   `Sp. Def` = col_double(),
##   Speed = col_double(),
##   Generation = col_double(),
##   Legendary = col_logical()
## )
head(Pokemon)
## # A tibble: 6 x 13
##     `#` Name  `Type 1` `Type 2` Total    HP Attack Defense `Sp. Atk`
##   <dbl> <chr> <chr>    <chr>    <dbl> <dbl>  <dbl>   <dbl>     <dbl>
## 1     1 Bulb… Grass    Poison     318    45     49      49        65
## 2     2 Ivys… Grass    Poison     405    60     62      63        80
## 3     3 Venu… Grass    Poison     525    80     82      83       100
## 4     3 Venu… Grass    Poison     625    80    100     123       122
## 5     4 Char… Fire     <NA>       309    39     52      43        60
## 6     5 Char… Fire     <NA>       405    58     64      58        80
## # … with 4 more variables: `Sp. Def` <dbl>, Speed <dbl>, Generation <dbl>,
## #   Legendary <lgl>
#recoding Legendary to factor
Pokemon$Legendary=as.factor(Pokemon$Legendary)
library(texreg)
## Version:  1.36.23
## Date:     2017-03-03
## Author:   Philip Leifeld (University of Glasgow)
## 
## Please cite the JSS article in your publications -- see citation("texreg").
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:texreg':
## 
##     extract
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
## The following object is masked from 'package:texreg':
## 
##     extract

Picking our specified model based on AIC/BIC

We will start with our main explanatory variable–Legendary (False or True)–and add features to try to best explain hitting points. We will then choose our working model based on the lowest AIC/BIC score.

reg1=zelig(HP~Legendary,model = "poisson",data=Pokemon,cite=F)
reg2=zelig(HP~Legendary+Attack,model = "poisson",data=Pokemon,cite=F)
reg3=zelig(HP~Legendary+Attack+Defense,model = "poisson",data=Pokemon,cite=F)
reg4=zelig(HP~Legendary+Attack+Defense+Speed,model = "poisson",data=Pokemon,cite=F)
htmlreg(list(reg1,reg2,reg3,reg4))
Statistical models
Model 1 Model 2 Model 3 Model 4
(Intercept) 4.21*** 3.89*** 3.86*** 3.86***
(0.00) (0.01) (0.01) (0.02)
LegendaryTRUE 0.32*** 0.15*** 0.15*** 0.15***
(0.01) (0.01) (0.01) (0.02)
Attack 0.00*** 0.00*** 0.00***
(0.00) (0.00) (0.00)
Defense 0.00*** 0.00***
(0.00) (0.00)
Speed -0.00
(0.00)
AIC 11319.91 10422.74 10404.06 10406.06
BIC 11329.28 10436.80 10422.80 10429.49
Log Likelihood -5657.95 -5208.37 -5198.03 -5198.03
Deviance 6506.50 5607.34 5586.66 5586.66
Num. obs. 800 800 800 800
p < 0.001, p < 0.01, p < 0.05

Based on both AIC and BIC, the lowest score for both is reg3, which we will use for our simulation and analysis!

Setting the counterfactuals, simulating our data, and plotting

Let us set our counterfactuals for Pokemon Legendary being false or true

reg3$setx(Legendary="FALSE")
reg3$setx1(Legendary="TRUE")

Simulating our Data:

reg3$sim()

Plotting our Data:

reg3$graph()

Here, we set up a first difference simulation for Legendary where all other features in regression 3 are set to their default values (i.e. mode, median, etc.). We set X to be Legendary that is false, and X1 to be Legendary that is true.

From looking at the visuals, we can see that Legendary being false has an expected value of scoring about 10.4 less hitting points than if Legendary being true. Breaking it down more, the Legendary being false has an expected value of scoring about 67.6 hitting points whereas the Legendary being false has an expected value of scoring about 78 hitting points.

reg3$get_qi(xvalue="x1",qi="fd")%>%
  data.frame()%>%
  summary()
##        .         
##  Min.   : 6.898  
##  1st Qu.: 9.949  
##  Median :10.726  
##  Mean   :10.759  
##  3rd Qu.:11.598  
##  Max.   :14.733