Simulation is a method of data analysis that allows the researcher to make predictions and quantify uncertainty without being a guru in mathematics.As indicated in a paper by Bennet Zelner from Duke University titled “Research Notes and Commentaries using Simulation to interpret results from Logit, Probit, and Other Nonlinear Models,” Simulation-based approach should be used since it favors direct numerical calculations. This mathematical approach is better since it is easy for readers without knowledge in multivariate calculus to understand.

Another reason to use simulation-based approaches for research can be seen in a science journal conduct at Harvard University, titled “Making the Most of Statistical Analyses: Improving Interpretations and Presentations.” According to this article, inferences are never certain so it is better to give a range of results that might occur. Also, this same paper goes on to explain that simulation-based approaches are more superior when compared to more traditional methods of data interpretations because it allows the researcher to extract new insights which are more challenging to do in other forms data analysis.

Zelig is a package in the R programming language that allows the user to predict, understand and present their findings in any statistical method. Zelig can be used for Simulation-based approaches of the data analysis. Recently, Zelig 5.0 was created and it differs in its predecessor (Zelig 4) in some ways. For instance, the syntax of Zelig 5 is different. The changes in syntax allow researchers to wrangle and present their findings more easily. Also, Zelig 5 emulates languages like that of Java and C# in that it uses reference classes (RC). RC’s acts as a mutable object. Having learned about the benefits of simulation-based approaches and combing this form of data analysis advantages with Zelig 5,the first part of this project will aim to discover insights in the builtin titanic dataset.

#install.packages("radiant.data")
#install.packages('snakecase')

Re-implement the computations described on pages 5.2-5.18 using the Zelig 5 syntax.

library(snakecase)
library(radiant.data)
data("titanic")
titanic=titanic %>%
  mutate(survived = as.integer(survived)) %>% 
  mutate(survival = sjmisc::rec(survived, rec = "2=0; 1=1")) %>% 
  select(survived, survival, pclass, fare, everything())

Logit Model to Predict survival rates.

Slide 5.2

library(Zelig)
Zelig5=zlogit$new()
Zelig5$zelig(survival ~ age + fare + sex * pclass, data = titanic)
summary(Zelig5)

## Model: 
## 
## Call:
## Zelig5$zelig(formula = survival ~ age + fare + sex * pclass, 
##     data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0634  -0.6641  -0.4943   0.4336   2.4941  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        4.8978215  0.6131092   7.988 1.37e-15
## age               -0.0387245  0.0068044  -5.691 1.26e-08
## fare              -0.0009058  0.0020509  -0.442  0.65874
## sexmale           -3.8996177  0.5015659  -7.775 7.55e-15
## pclass2nd         -1.5923247  0.6024844  -2.643  0.00822
## pclass3rd         -4.1382715  0.5601819  -7.387 1.50e-13
## sexmale:pclass2nd -0.0600076  0.6372949  -0.094  0.92498
## sexmale:pclass3rd  2.5019110  0.5479901   4.566 4.98e-06
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1409.99  on 1042  degrees of freedom
## Residual deviance:  931.45  on 1035  degrees of freedom
## AIC: 947.45
## 
## Number of Fisher Scoring iterations: 5
## 
## Next step: Use 'setx' method

The Model above is using logistic regression to discover insight for the factors that influenced the survival rate in the Titanic. To do this, we will use the Zelig package and Zelig 5 syntax. The dependent variable is the survival variable. The independent variables are age,fare sex and pclass. The pclass variable indicates if the passager had a 1st class ticket, 2nd class ticket, and 3rd class ticket.

As seen above, the logistic regression model generated results are may be difficult to interpret. With this notion in mind, we the R chucks below will use a simulation-based approach to create more comprehensive findings.

Simulation-based approach using fare and survival rates.

Slide 5.3 - 5.4

fare.range = min(titanic$fare):max(titanic$fare)
Zelig5$setrange(fare=fare.range)
Zelig5$sim()
Zelig5$graph()

As you can see in the table above, the fare for a ticket was in the range of 0 - 500 units of money. Interestingly enough, the amount of money a person spent on their ticket did not increase their chances of survival. People who got on the ship for free(i.e., ship maintenance workers and pilots) had around the same chances of living that night as people who spent a lot of money on their ticket.

Simulation-based approach using age and survival rates.

Slide 5.5 - 5.6

age.range = min(titanic$age):max(titanic$age)
Zelig5$setrange(age=age.range)
Zelig5$sim()
Zelig5$graph()

The table above displays exciting findings. For instance, it would appear that age was a good predictor of survival. The younger a person was, the better chances that had at living. The older you were, the odds of you living decreased. This intuitively makes sense because babies and young children were possibly the first to board the lifeboats and get out of the ship in time.

Slide 5.7 - 5.9

Zelig5=zlogit$new()
Zelig5$zelig(survival ~ age + fare + sex * pclass, data = titanic)
Zelig5$setx(sex="male")
Zelig5$setx1(sex="female")
Zelig5$sim()
Zelig5$graph()

Slide 5.11 - 5.18

Zelig5=zlogit$new()
Zelig5$zelig(survival ~ age + sex * pclass + fare, data = titanic)
Zelig5$setx(sex="male",pclass="1st")
Zelig5$setx1(sex="female",pclass="1st")
Zelig5$sim()
Zelig5$graph()

Zelig5$setx(sex="male",pclass="2nd")
Zelig5$setx1(sex="female",pclass="2nd")
Zelig5$sim()
Zelig5$graph()

Zelig5$setx(sex="male",pclass="3rd")
Zelig5$setx1(sex="female",pclass="3rd")
Zelig5$sim()
Zelig5$graph()

z_tit <- zelig(survival ~ age + sex*pclass + fare, model = "logit", data = titanic, cite = F)
c1x <- setx(z_tit, sex = "male", pclass = "1st")
c1x1 <- setx(z_tit, sex = "female", pclass = "1st")
c1s <- sim(z_tit, x = c1x, x1 = c1x1)
c2x <- setx(z_tit, sex = "male", pclass = "2nd")
c2x1 <- setx(z_tit, sex = "female", pclass = "2nd")
c2s <- sim(z_tit, x = c2x, x1 = c2x1)
c3x <- setx(z_tit, sex = "male", pclass = "3rd")
c3x1 <- setx(z_tit, sex = "female", pclass = "3rd")
c3s <- sim(z_tit, x = c3x, x1 = c3x1)

d1 <- c1s$get_qi(xvalue="x1", qi="fd")
d2 <- c2s$get_qi(xvalue="x1", qi="fd")
d3 <- c3s$get_qi(xvalue="x1", qi="fd")

dfd <- as.data.frame(cbind(d1, d2, d3))
tidd <- dfd %>% 
  gather(class, simv, 1:3)
tidd %>% 
  group_by(class) %>% 
  summarise(mean = mean(simv), sd = sd(simv))

## # A tibble: 3 x 3
##   class  mean     sd
##   <chr> <dbl>  <dbl>
## 1 V1    0.520 0.0518
## 2 V2    0.748 0.0428
## 3 V3    0.254 0.0458

ggplot(tidd, aes(simv)) + geom_histogram() + facet_grid(~class)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

All of the above R chunks were designed to study the effects of sex and passengers class on surviving the Titanic crashing. Based on the data, it would appear that sex does play a role in surviving because females had higher chances of living then males. Also, the higher the passengers class was, the more likely there were to survive.

Simulation-based approach to study Pokémon.

I want to be the very best, the best that no one ever was. Any person born in the 1990s or who happens to be a fan of Japanese anime, would know that the previous line is the opening to the show known as Pokemon. However, for all of those individuals that are not familiar with the Pokeverse, Pokemon the tv show was based on a popular video game on the Nintendo Gameboy.

In this game, the player would control an avatar and try to catch and train creatures with superpowers known as pocket monsters or Pokémon for short. The whole purpose of the pokemon game was to have the strongest monsters and win the champion tournament.

Pokémon have something known as statistics or stats for short. PokeStats is a digital number that determines how strong a Pokémon is in the game. There are dozens of Pokestats in the game. However, the most critical Pokestats are Attack (how much damage a Pokémon deals) and Defense (how much damage a Pokémon can receive.). Generally, if your pokemon has stronger Attack and Defense PokeStats, you will win the match.

It must not go unmentioned that the Pokeverse has a small group of Pokémon known as Legendary Pokémon. Pocket Monsters that are classified as Legendary are rare mythical Pokémon that are assumed to be more powerful in the game. However, we do not know how much stronger a Legendary Pokémon is when compared to other pocket monsters. With this in mind, this analysis will use a simulation-based analysis to examine if Legendary Pokémon are stronger then non-Legendary pocket monsters. To do this I will use selected variables to see how Type, Attack, and Defense influences a Pokemon Legendary Status.

To conduct this investigation, this analysis will use the “Pokémon” dataset found on Kaggle. This is a public dataset open to anyone with a Kaggle account. This dataset has all of the Pokestats for every Pokémon in the Pokeverse. My dependent variable is Legendary_Status and my Independent variables are Type, Attack, and Defense.

library(readr)
Pokemon <- read_csv("Desktop/Pokemon.csv")

library(dplyr)
library(sjmisc)

## 
## Attaching package: 'sjmisc'

## The following objects are masked from 'package:radiant.data':
## 
##     center, is_empty

## The following object is masked from 'package:tidyr':
## 
##     replace_na

New_Pokemon_Data <- Pokemon %>%
select(Type1, Attack, Defense, Legendary) %>%
mutate(Legendary_Status = rec(Legendary, rec = "FALSE=0 ; TRUE=1 "),
       Attack = as.integer(Attack),
       Defense = as.integer(Defense),
       Type1 = factor(Type1))

Linear_Model <- lm(Attack ~ Defense,data = New_Pokemon_Data)
summary(Linear_Model)

## 
## Call:
## lm(formula = Attack ~ Defense, data = New_Pokemon_Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -140.304  -19.203   -4.039   16.234  125.584 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 45.28420    2.65383   17.06   <2e-16 ***
## Defense      0.45661    0.03311   13.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.19 on 798 degrees of freedom
## Multiple R-squared:  0.1924, Adjusted R-squared:  0.1914 
## F-statistic: 190.2 on 1 and 798 DF,  p-value: < 2.2e-16

library(visreg)
visreg(Linear_Model,"Defense", scale = "response")

In the Pokeverse, it is a well-known fact that the more damage a pokemon and deal, typically, the more damage they can receive. This simple linear model and graph depict this common assumption. However, this model does not describe how Pokestats impacts Legendary status. To understand the effects of Pokestats on Legendary status, we will use the Zelig package.

library(Zelig)
Pokemon.z1 <- zelig(Legendary_Status ~ Attack+ Defense+ Type1, data = New_Pokemon_Data, model = "logit", cite = F)
summary(Pokemon.z1)

## Model: 
## 
## Call:
## z5$zelig(formula = Legendary_Status ~ Attack + Defense + Type1, 
##     data = New_Pokemon_Data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.14898  -0.31089  -0.16275  -0.04663   2.64503  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -2.517e+01  1.071e+03  -0.023    0.981
## Attack         3.555e-02  5.436e-03   6.538 6.22e-11
## Defense        2.238e-02  5.548e-03   4.034 5.49e-05
## Type1Dark      1.727e+01  1.071e+03   0.016    0.987
## Type1Dragon    1.846e+01  1.071e+03   0.017    0.986
## Type1Electric  1.844e+01  1.071e+03   0.017    0.986
## Type1Fairy     1.794e+01  1.071e+03   0.017    0.987
## Type1Fighting  1.697e-01  2.201e+03   0.000    1.000
## Type1Fire      1.777e+01  1.071e+03   0.017    0.987
## Type1Flying    2.071e+01  1.071e+03   0.019    0.985
## Type1Ghost     1.718e+01  1.071e+03   0.016    0.987
## Type1Grass     1.716e+01  1.071e+03   0.016    0.987
## Type1Ground    1.701e+01  1.071e+03   0.016    0.987
## Type1Ice       1.769e+01  1.071e+03   0.017    0.987
## Type1Normal    1.624e+01  1.071e+03   0.015    0.988
## Type1Poison    1.074e+00  2.212e+03   0.000    1.000
## Type1Psychic   1.938e+01  1.071e+03   0.018    0.986
## Type1Rock      1.656e+01  1.071e+03   0.015    0.988
## Type1Steel     1.649e+01  1.071e+03   0.015    0.988
## Type1Water     1.667e+01  1.071e+03   0.016    0.988
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 450.90  on 799  degrees of freedom
## Residual deviance: 282.68  on 780  degrees of freedom
## AIC: 322.68
## 
## Number of Fisher Scoring iterations: 18
## 
## Next step: Use 'setx' method

The model depicts exciting findings on how PokeStats impacts status . For instance, for every unit of measurement a Pokémon gets closer to Legendary status, the stronger attack and defense a Pokémon will get. However, I must mention that a Pokémon’s type plays a critical role in how strong the Pokémon will become. As Pokémon gets closer to being a Legendary status since different types increase their power levels at different rates.

Pokémon types are properties and attributes related to Pokémon and their moves it can use to deal damage while in fighting another Pokémon. Typically if a Pokémon lives in a body of water, it is grouped as a water type. If a Pokémon can fly, it is then classified as a flying type. As of the day this investigation was completed, there are 18 types and 812 known Pocket Monsters.

Simulation-based approach using Attack and Legendary status.

Attack.range <- min(New_Pokemon_Data$Attack):max(New_Pokemon_Data$Attack)
Pokemon.z1$setrange(Attack=Attack.range)
Pokemon.z1$sim()
Pokemon.z1$graph()

The graphs above do depict the trend discovered in the previous R chunk. A Pokémon’s Legendary status does impact their attack statistic. The closer the Pokemon get to Legendary status, the stronger, the more damage they can deal.

Simulation-based approach using Defense and Legendary status.

Defense.range <- min(New_Pokemon_Data$Defense):max(New_Pokemon_Data$Defense)
Pokemon.z1$setrange(Defense=Defense.range)
Pokemon.z1$sim()
Pokemon.z1$graph()

The R chunk supports the claim that Legendary Pokémon can also receive more damage then Non-Legendary Pokémon. The closer a Pokémon gets to Legendary status, the higher their defense statistic becomes.

Visreg package to visualize the impacts of Legendary on Pokestats.

To get a visual on how Lendrary status impacts Pokestats by type, I will use the visreg package.

Status_Model <- glm(Legendary_Status ~ Defense + Attack*Type1, family = "binomial", data = New_Pokemon_Data)
visreg(Status_Model,"Attack", by = "Type1", scale = "response")

visreg(Status_Model,"Defense", by = "Type1", scale = "response")

As seen in the table, a Pokémon’s type does influence how much damage it can deal and receive. I would say that some of the strongest hitting Pokémon are Psychic, Flying, Electric and Fairy types. So of the types that can take the most damage are Psychic and Dragon. Please note that Legendary status is taken into consideration for models. Therefore, the stronger a Pokémon is, the more likely they are to be a Legendary Pocket Monster.

Concluding remarks

In this project, we learned to use Zelig 5 syntax to use a simulation-based approach of analysis to study what factors impact the survival rate of the Titanic ship sinking. We also used this same method of analysis to examine the factors that lead to a Pokémon being legendary or not. We learned that sex was a good indicator that predicted if a person would survive the sinking of the Titanic. We also learned that legendary Pokémon are stronger then non-legendary and that certain types of legendary Pokémon are stronger than other types of mythical Pocket Monsters.

HW6_Part1 and 2

Ariel Rosario Jr

3/21/2019