Week 10

Author: Navdeep Metchu

Date: 2023-10-25

/* General Styles */
body {
    font-family: 'Arial', sans-serif;
    background-color: #f4f4f4;
    margin: 0;
    padding: 20px;
}

/* Header Styles */
header {
    background-color: #333;
    color: #fff;
    padding: 10px 0;
    text-align: center;
}

header h1 {
    margin: 0;
    font-size: 2.5em;
}

header p {
    margin-top: 5px;
    font-size: 1.1em;
}
# Read the dataset
pokemon_data <- read.csv("PokemonStats.csv")

Selecting an Interesting Binary Column

Binary Variable: “Is its total stat above 500?”

Yes (1)

No (0)

# Creating a new binary column 'HighTotalStat'
pokemon_data$HighTotalStat <- as.integer(pokemon_data$Total > 500)

# Displaying the first few rows of the dataset with the new column
head(pokemon_data[, c("Name", "Total", "HighTotalStat")])
##                     Name Total HighTotalStat
## 1              Bulbasaur   318             0
## 2                Ivysaur   405             0
## 3               Venusaur   525             1
## 4 Venusaur Mega Venusaur   625             1
## 5             Charmander   309             0
## 6             Charmeleon   405             0

We’ve created the binary variable HighTotalStat that indicates whether a Pokémon has a total stat above 500 (represented by 1) or not (represented by 0).

Building a Logistic Regression Model

For the logistic regression model, the dependent variable will be HighTotalStat, and we’ll choose 4 explanatory variables
To determine which explanatory variables might be most relevant for our logistic regression model, we’ll check the correlation of each potential variable with our target variable, HighTotalStat.
# Calculate correlation of potential explanatory variables with 'HighTotalStat'
correlations <- cor(pokemon_data[, c('HP', 'Attack', 'Defense', 'SpAtk', 'SpDef', 'Speed', 'Height', 'Weight', 'HighTotalStat')])['HighTotalStat',]

# Remove the correlation of 'HighTotalStat' with itself
correlations <- correlations[-which(names(correlations) == "HighTotalStat")]

# Sort correlations in descending order
sorted_correlations <- sort(correlations, decreasing=TRUE)

sorted_correlations
##     SpAtk    Attack     SpDef        HP   Defense     Speed    Height 
## 0.5825124 0.5254463 0.5151532 0.4693824 0.4465165 0.4003035 0.2315656

Based on the correlation values, the variables that have the highest correlation with our target variable HighTotalStat are:

SpAtk with a correlation of approximately 0.583
Attack with a correlation of approximately 0.525
SpDef with a correlation of approximately 0.515
HP (Hit Points) with a correlation of approximately 0.469

We’ll choose these 4

Building the logistic regression model using these variables.

X <- pokemon_data[, c('SpAtk', 'Attack', 'SpDef', 'HP')]
y <- pokemon_data$HighTotalStat
logit_model <- glm(y ~ SpAtk + Attack + SpDef + HP, data=pokemon_data, family=binomial(link="logit"))

summary(logit_model)
## 
## Call:
## glm(formula = y ~ SpAtk + Attack + SpDef + HP, family = binomial(link = "logit"), 
##     data = pokemon_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -21.587125   1.557261 -13.862  < 2e-16 ***
## SpAtk         0.063047   0.005516  11.430  < 2e-16 ***
## Attack        0.075208   0.006561  11.463  < 2e-16 ***
## SpDef         0.070693   0.007011  10.083  < 2e-16 ***
## HP            0.038298   0.005336   7.177 7.13e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1468.44  on 1193  degrees of freedom
## Residual deviance:  504.17  on 1189  degrees of freedom
## AIC: 514.17
## 
## Number of Fisher Scoring iterations: 8

1. Coefficients:

The constant (or intercept) is −21.5871.
The coefficient for SpAtk is 0.0630.
The coefficient for Attack is 0.0752.
The coefficient for SpDef is 0.0707.
The coefficient for HP is 0.0383.

2. Interpretation:

Constant (−21.5871): This represents the log odds of a Pokémon having a total stat above 500 when all the explanatory variables are 0. Given that it’s not realistic for these stats to be zero, this value is mainly theoretical.
SpAtk: For a unit increase in the SpAtk value, the log odds of the Pokémon having a total stat greater than 500 increase by 0.0630, keeping all other variables constant.
Attack: For a unit increase in the Attack value, the log odds of the Pokémon having a total stat greater than 500 increase by 0.0752, keeping all other variables constant.
SpDef: For a unit increase in the SpDef value, the log odds of the Pokémon having a total stat greater than 500 increase by 0.0707, keeping all other variables constant.
HP: For a unit increase in the HP value, the log odds of the Pokémon having a total stat greater than 500 increase by 0.0383, keeping all other variables constant.

3. Significance:

All the explanatory variables have a p-value less than 0.05, which suggests that they are statistically significant predictors of whether a Pokémon has a total stat greater than 500.
pokemon_data$HighTotalStat <- as.integer(pokemon_data$Total > 500)

# List of explanatory variables
explanatory_vars <- c('SpAtk', 'Attack', 'SpDef', 'HP')

# Setting up the plotting parameters
par(mfrow=c(4,1), mar=c(4,4,2,1)) 

for (var in explanatory_vars) {
  p <- ggplot(pokemon_data, aes(x=factor(HighTotalStat), y=get(var), fill=factor(HighTotalStat))) + 
    geom_boxplot() +
    scale_fill_manual(values=c("#98A886", "#735290"), name="HighTotalStat", breaks=c(0, 1), labels=c("No", "Yes")) +
    labs(title=paste('Box plot of', var, 'vs. HighTotalStat'),
         x='HighTotalStat', y=var) +
    theme_minimal() +
    theme(
      legend.position="top",
      panel.background = element_rect(fill="lightgray", colour="black", linewidth=1), 
      panel.grid.major = element_line(colour = "white"),
      panel.grid.minor = element_line(colour = "white", linetype = "dashed")
    )
  print(p)
}

The box plots above illustrate the distribution of our selected explanatory variables (SpAtk, Attack, SpDef, and HP) across the two categories of the binary target variable HighTotalStat.

From the box plots:

Distribution Disparity: The median values of SpAtk, Attack, SpDef, and HP are noticeably higher for Pokémon with a HighTotalStat of 1 (indicating a total stat above 500) compared to those with a HighTotalStat of 0.
Spread & Outliers: The spread (interquartile range) and overall distribution of these variables are visibly different between the two categories. There are also some outliers, especially for Pokémon with a HighTotalStat of 0.

Consideration for Transformation:

If we were to consider transformations:

Log transformation: Useful if the relationship between the dependent and independent variables is multiplicative in nature.
Polynomial transformation: Useful if the relationship seems to be polynomial (quadratic, cubic, etc.).
However, based on the visual inspection from the box plots, there doesn’t seem to be a strong need for any transformations. The differences in distributions for the two categories of our target variable suggest the original variables are suitable for logistic regression without further transformation.

Insights & Further Investigation:

Insight: Pokémon with higher values in SpAtk, Attack, SpDef, and HP are more likely to have a total stat above 500.
Significance: Understanding which stats are significant predictors can help in predicting or categorizing Pokémon based on their overall prowess.

Further Questions:

1. Are there interactions between these stats that further influence the total stat value?
2. Would considering other stats or attributes (like Type1 or Type2) provide more insight or improve the model’s predictive power?
3. How do legendary Pokémon compare to regular Pokémon in terms of these stats?