Pokémon Regression Modelling

1 Introduction

This study aims to explore how speed points vary depending on the Pokémon’s primary type (type_1), secondary type (type_2), and weight. The report is divided into two sections: (1) Speed points variation in association with Pokémon’s primary type, (2) Speed points variation in association with Pokémon’s both types after controlling for the weight. Each of the sections consists of the following items.

2 Speed Points Variation in Association with Pokémon Primary Type

2.1 Exploratory Data Analysis

Before doing any visualisation or hypothesis testing, I’d take a look at the data structure: what covariates are present, the number of Pokémon types, and missing values if any. This information will enable me to understand the Pokémon type distribution, whether they are balanced or not.

library(pokemon)
pokemon$type_1 <- as.factor(pokemon$type_1)
pokemon$type_2 <- as.factor(pokemon$type_2)
head(pokemon)
  id    pokemon species_id height weight base_experience type_1 type_2 hp
1  1  bulbasaur          1    0.7    6.9              64  grass poison 45
2  2    ivysaur          2    1.0   13.0             142  grass poison 60
3  3   venusaur          3    2.0  100.0             236  grass poison 80
4  4 charmander          4    0.6    8.5              62   fire   <NA> 39
5  5 charmeleon          5    1.1   19.0             142   fire   <NA> 58
6  6  charizard          6    1.7   90.5             240   fire flying 78
  attack defense special_attack special_defense speed color_1 color_2 color_f
1     49      49             65              65    45 #78C850 #A040A0 #81A763
2     62      63             80              80    60 #78C850 #A040A0 #81A763
3     82      83            100             100    80 #78C850 #A040A0 #81A763
4     52      43             60              50    65 #F08030    <NA>    <NA>
5     64      58             80              65    80 #F08030    <NA>    <NA>
6     84      78            109              85   100 #F08030 #A890F0 #DE835E
  egg_group_1 egg_group_2
1     monster       plant
2     monster       plant
3     monster       plant
4     monster      dragon
5     monster      dragon
6     monster      dragon
                                                 url_icon generation_id
1 //archives.bulbagarden.net/media/upload/7/7b/001MS6.png             1
2 //archives.bulbagarden.net/media/upload/a/a0/002MS6.png             1
3 //archives.bulbagarden.net/media/upload/0/07/003MS6.png             1
4 //archives.bulbagarden.net/media/upload/7/7d/004MS6.png             1
5 //archives.bulbagarden.net/media/upload/b/be/005MS6.png             1
6 //archives.bulbagarden.net/media/upload/4/41/006MS6.png             1
                                                                            url_image
1 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/001.png
2 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/002.png
3 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/003.png
4 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/004.png
5 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/005.png
6 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/006.png
rows_amt <- nrow(pokemon)
print(paste("The number of Pokémon in the dataset:", rows_amt))
[1] "The number of Pokémon in the dataset: 949"
# Check the structure
str(pokemon$type_1)
 Factor w/ 18 levels "bug","dark","dragon",..: 10 10 10 7 7 7 18 18 18 1 ...
# 1. Count NAs
na_type_1 <- sum(is.na(pokemon$type_1))
print(paste("NAs count for type_1: ", na_type_1))
[1] "NAs count for type_1:  0"
# 2. See the breakdown of all levels (NA would appear separately if present)
summary(pokemon$type_1)
     bug     dark   dragon electric    fairy fighting     fire   flying 
      79       37       39       61       19       31       59        4 
   ghost    grass   ground      ice   normal   poison  psychic     rock 
      40       84       36       29      111       35       64       65 
   steel    water 
      30      126 
# 3. Proportion of NAs
na_prop_type_1 <- mean(is.na(pokemon$type_1))
print(paste("Proportion of NAs in type_1: ", na_prop_type_1))
[1] "Proportion of NAs in type_1:  0"
# Check the structure
str(pokemon$type_2)
 Factor w/ 18 levels "bug","dark","dragon",..: 14 14 14 NA NA 8 NA NA NA NA ...
# 1. Count NAs
na_type_2 <- sum(is.na(pokemon$type_2))
print(paste("NAs count for type_2: ", na_type_2))
[1] "NAs count for type_2:  439"
# 2. See the breakdown of all levels (NA would appear separately if present)
summary(pokemon$type_2)
     bug     dark   dragon electric    fairy fighting     fire   flying 
       5       25       22       12       39       33       15      121 
   ghost    grass   ground      ice   normal   poison  psychic     rock 
      18       26       40       15        7       35       36       14 
   steel    water     NA's 
      29       18      439 
# 3. Proportion of NAs
na_prop_type_2 <- mean(is.na(pokemon$type_2))
print(paste("Proportion of NAs in type_2: ", na_prop_type_2))
[1] "Proportion of NAs in type_2:  0.46259220231823"

A total of 18 Pokémon types are observed in the dataset, with no missing values in the type_1 covariate The distribution of Pokémon types is notably imbalanced. Among the 949 recorded Pokémon, only four are classified as Flying-type, whereas Water-type Pokémon account for as many as 126 observations.

On the other hand, there are also a total of 18 Pokémon types present in the type_2 with 439 missing values. However, I assume that NA here represents for that Pokémon instead of missing data. Furthermore, the distribution of the factors in the type_2 is also imbalanced with only five are Bug-type, but as many as 121 Pokémon are Flying-type.

In the next step, I plot speed points of 18 primary types (type_1) to see if there is any speed variation among existing types. The plot below displays speed variation according to the type, with dragon, electric, fairy, flying, grass, psychic, and steel noticeably stand out from the other types.

library(tidyverse)
ggplot(pokemon, aes(x = type_1, y = speed)) +
  geom_boxplot() +
  geom_point() +
  labs(title = "Pokémon primary type vs speed", x = "Primary Type", y = "Speed") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The presence of outliers in several types, such as psychic, bug, electric, and rock, caught my attention. Further research into the Pokémon community shows that there is another category of Pokémon called . However, this is not an official category, such as those in type_1 and type_2, but they are spread across every type. Legendary Pokémon are powerful and have higher power stats compared to the normal ones (https://bulbapedia.bulbagarden.net/wiki/Legendary_Pok%C3%A9mon). Hence, I suspect that Pokémon types with a high number of Legendary Pokémon, affect the type’s speed average.

To affirm the theory, the Pokémons are classified into two new categories (the new classification is named legendary_group): (1) primary types with a high number of Legendary Pokémon (called ): psychic, dragon, flying, and fire (https://thelostlambda.github.io/pokestats/); (2) primary types with a low number of Legendary Pokémon (called ): consists of other types not yet mentioned. As seen in the plot below, . Later on, this will be tested with ANOVA to see if either of the two levels’ speed is different from the overall average speed.

# Create a binary grouping variable
pokemon$legendary_group <- ifelse(pokemon$type_1 %in% c("psychic", "dragon", "flying", "fire"), "legendary", "non_legendary")

pokemon$legendary_group <- as.factor(pokemon$legendary_group)

ggplot(pokemon, aes(x = legendary_group, y = speed)) +
  geom_boxplot() +
  geom_point() +
  labs(title = "Pokémon Legendary group vs speed", x = "Legendary group", y = "Speed") +
  theme_bw()

Furthermore, Pokémon stats distribution is closely related to its combat style, which is divided into three classes based on the Pokémon community: offensive, defensive, and balanced (https://www.reddit.com/r/TruePokemon/comments/fimlpi/offensive_vs_defensive_typings/). To investigate how speed points vary in association to the Pokémon’s combat style, Pokémons are classified into three levels: (1) offensive, covering fighting, ground, fire, ice, electric, and rock types; (2) defensive, covering steel, fairy, water, dragon, ghost, and poison; (3) balanced, covering normal, flying, psychic, dark, grass, and bug.

Pokémon classification vs speed plot shows that both balanced and defensive Pokémon have the same average speed, while offensive ones have a higher average speed. This signals that offensive-typed Pokémon has average speed that is different from the overall average speed. Further ANOVA test will be used to verify this hypothesis.

# Create offensive/defensive/balanced classification
pokemon$type_classification <- case_when(
  pokemon$type_1 %in% c("fighting", "ground", "fire", "ice", "electric", "rock") ~ "offensive",
  pokemon$type_1 %in% c("steel", "fairy", "water", "dragon", "ghost", "poison") ~ "defensive",
  pokemon$type_1 %in% c("normal", "flying", "psychic", "dark", "grass", "bug") ~ "balanced",
  TRUE ~ NA_character_
)

pokemon$type_classification <- as.factor(pokemon$type_classification)

# Check the classification
table(pokemon$type_classification)

 balanced defensive offensive 
      379       289       281 
# Visualisation of speed vs combat style
ggplot(pokemon, aes(x = type_classification, y = speed)) +
  geom_boxplot() +
  geom_point() +
  labs(title = "Pokémon classification vs speed", x = "Pokémon classification", y = "Speed") +
  theme_bw()

2.2 ANOVA Test

2.2.1 Testing if any type in type_1 is different from the overall average speed

Firstly, I employ ANOVA test to evaluate the significance of regression model of speed against type_1 using centered parametrisation approach (lm(formula = speed ~ type_1)):

\(H_0: \beta_1 = ... = \beta_{18} = 0\)

\(H_1: \text{at least one of } \beta_1, ..., \beta_{18} \text{ is not 0}\)

library(standardize)
# Set sum-to-zero contrasts on type_1
contrasts(pokemon$type_1) <- named_contr_sum(levels(pokemon$type_1))

# Fit the model
fit_speed_stz <- lm(speed ~ type_1, data = pokemon)

# Get summary
summary(fit_speed_stz)

Call:
lm(formula = speed ~ type_1, data = pokemon)

Residuals:
    Min      1Q  Median      3Q     Max 
-65.288 -20.722  -0.938  19.278  99.062 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     70.20605    1.25563  55.913  < 2e-16 ***
type_1bug       -6.88959    3.24363  -2.124  0.03393 *  
type_1dark       6.41557    4.54691   1.411  0.15859    
type_1dragon    12.66574    4.43791   2.854  0.00441 ** 
type_1electric  17.00706    3.62774   4.688 3.17e-06 ***
type_1fairy    -16.62710    6.22632  -2.670  0.00771 ** 
type_1fighting  -2.62541    4.93667  -0.532  0.59498    
type_1fire       3.26853    3.68147   0.888  0.37486    
type_1flying    32.29395   13.35033   2.419  0.01576 *  
type_1ghost     -4.55605    4.38658  -1.039  0.29924    
type_1grass     -9.92034    3.16049  -3.139  0.00175 ** 
type_1ground    -5.62272    4.60488  -1.221  0.22238    
type_1ice       -5.65433    5.09340  -1.110  0.26723    
type_1normal     0.08224    2.81825   0.029  0.97673    
type_1poison    -6.03462    4.66537  -1.293  0.19616    
type_1psychic   10.73145    3.55211   3.021  0.00259 ** 
type_1rock      -5.94451    3.52812  -1.685  0.09234 .  
type_1steel    -14.10605    5.01304  -2.814  0.00500 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 28.19 on 931 degrees of freedom
Multiple R-squared:  0.08497,   Adjusted R-squared:  0.06826 
F-statistic: 5.085 on 17 and 931 DF,  p-value: 9.783e-11
## Calculating type_1=water estimated coefficient and p-value
V <- vcov(fit_speed_stz)
coef_all <- coef(fit_speed_stz)
coef_factors <- coef_all[-1]  # Remove intercept

k <- length(coef_factors)
n <- nrow(pokemon)
p <- length(coef_all)
df <- n - p

beta_missing <- -sum(coef_factors) # type_1=water estimated coefficient

V_factors <- V[-1, -1]
c_contrast <- rep(1, k)
var_beta_missing <- as.numeric(t(c_contrast) %*% V_factors %*% c_contrast) # calculate variance of type_1=water 

sd_beta_missing <- sqrt(var_beta_missing) # calculate std deviation of type_1=water 

t_stat <- beta_missing / sd_beta_missing # calculate t-stat

p_value <- 2 * (1 - pt(abs(t_stat), df = df)) # calculate p-value

print(beta_missing)
[1] -4.483828
print(p_value)
[1] 0.09470183
fit_null <- lm(speed ~ 1, data = pokemon)
anova(fit_null, fit_speed_stz)
Analysis of Variance Table

Model 1: speed ~ 1
Model 2: speed ~ type_1
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    948 808815                                  
2    931 740094 17     68721 5.0851 9.783e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the one-way ANOVA test, \(H_0\) is rejected (p-value = 9.783e-11), and . Furthermore, as indicated in Pokémon primary type vs speed plot, the graph indicates that at least dragon, electric, fairy, flying, grass, psychic, and steel have different average speeds compared to the overall average. A closer look at the regression summary (using sum to zero approach) in the regerssion summary above, demonstrating that .

2.2.2 Testing if any of the types with high and low numbers of legendary Pokémon are different from the overall average speed

Further ANOVA test to evaluate the significance of regression model of speed against the legendary_group with the formula lm(formula = speed ~ legendary_group) using centered parametrisation approach:

\(H_0: \beta_1 = \beta_2 = 0\)

\(H_1: \text{at least one of } \beta_1, \beta_2 \text{ is not equal 0}\)

contrasts(pokemon$legendary_group) <- named_contr_sum(levels(pokemon$legendary_group)) # Set sum-to-zero contrasts on type_1
fit_legendary <- lm(speed ~ legendary_group, data = pokemon) # Fit model comparing the two groups
summary(fit_legendary)

Call:
lm(formula = speed ~ legendary_group, data = pokemon)

Residuals:
    Min      1Q  Median      3Q     Max 
-61.853 -21.853  -1.853  20.741 100.741 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)                73.056      1.232  59.289  < 2e-16 ***
legendary_grouplegendary    6.203      1.232   5.034 5.75e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 28.84 on 947 degrees of freedom
Multiple R-squared:  0.02606,   Adjusted R-squared:  0.02503 
F-statistic: 25.34 on 1 and 947 DF,  p-value: 5.748e-07
fit_null <- lm(speed ~ 1, data = pokemon)
anova(fit_null, fit_legendary) # ANOVA test to check if there is legendary group that is different from the overall average speed
Analysis of Variance Table

Model 1: speed ~ 1
Model 2: speed ~ legendary_group
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    948 808815                                  
2    947 787736  1     21080 25.341 5.748e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test produces p-value of 5.748e-07 (\(<\) 0.05). Hence, \(H_0\) is rejected, and it is concluded that at least one of the legendary_group is statistically significant in predicting the speed points. Furthermore, the regression summary above shows that the estimated coefficients of both legendary and non_legendary levels are significant with a p-value of 5.75e-07 (\(<\) 0.05), meaning that .

2.2.3 Testing if any of the offensive, defensive, and balanced Pokémon types are different from the overall average speed

ANOVA test is employed to test the significance of the regression model for speed points against the Pokémon’s combat style lm(formula = speed ~ type_classification) using centered parametrisation approach:

\(H_0: \beta_1 = ... = \beta_{3} = 0\)

\(H_1: \text{at least one of } \beta_1, ..., \beta_{3} \text{ is not 0}\)

# Set sum-to-zero contrasts on type_1
contrasts(pokemon$type_classification) <- named_contr_sum(levels(pokemon$type_classification))

# Fit model comparing the two groups
fit_classification <- lm(speed ~ type_classification, data = pokemon)
summary(fit_classification)

Call:
lm(formula = speed ~ type_classification, data = pokemon)

Residuals:
    Min      1Q  Median      3Q     Max 
-64.375 -23.375  -1.616  20.625 110.625 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   69.0106     0.9552  72.250   <2e-16 ***
type_classificationbalanced    0.3641     1.2885   0.283   0.7776    
type_classificationdefensive  -2.9691     1.3759  -2.158   0.0312 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.16 on 946 degrees of freedom
Multiple R-squared:  0.00557,   Adjusted R-squared:  0.003467 
F-statistic: 2.649 on 2 and 946 DF,  p-value: 0.07124
## Calculating offensive estimated coefficient and p-value
V <- vcov(fit_classification)
coef_all <- coef(fit_classification)
coef_factors <- coef_all[-1]  # Remove intercept

k <- length(coef_factors)
n <- nrow(pokemon)
p <- length(coef_all)
df <- n - p

beta_missing <- -sum(coef_factors) # type_1=water estimated coefficient

V_factors <- V[-1, -1]
c_contrast <- rep(1, k)
var_beta_missing <- as.numeric(t(c_contrast) %*% V_factors %*% c_contrast) # calculate variance of offensive

sd_beta_missing <- sqrt(var_beta_missing) # calculate std deviation of offensive

t_stat <- beta_missing / sd_beta_missing # calculate t-stat

p_value <- 2 * (1 - pt(abs(t_stat), df = df)) # calculate p-value

print(beta_missing)
[1] 2.605041
print(p_value)
[1] 0.06047163
fit_null <- lm(speed ~ 1, data = pokemon)
anova(fit_null, fit_classification)
Analysis of Variance Table

Model 1: speed ~ 1
Model 2: speed ~ type_classification
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1    948 808815                              
2    946 804311  2    4504.7 2.6491 0.07124 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on one-way ANOVA test, the model fails to reject \(H_0\) (p-value = 0.07124 \(>\) 0.05), and it is concluded that the model is not statistically significant. However, the regression summary shows that the defensive type showed a statistically significant difference from the grand mean (refer to the regression summary table above), with a mean speed approximately 2.9691 points lower (p = 0.0312 \(<\) 0.05). Hence, despite this individual significance, the model explains very little variance in speed (\(R^2\) = 0.0056), and the overall factor was marginally insignificant.

3 Speed points variation in association with Pokémon’s both types after controlling for the weight

3.1 Exploratory Data Analysis

ggplot(pokemon, aes(x = weight, y = speed, group = type_1, color = type_1)) +
  scale_color_viridis_d() +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Speed vs Weight by Pokémon Type", x = "Weight", y = "Speed") +
  theme_bw()

Speed vs Weight by Pokémon Type plot shows that the relationship between speed points and weight differs considerably depending on the primary type, signalling the importance of the interaction term between weight and type_1 in the regression model later.

3.2 ANCOVA Test

As it is assumed that the NA values in the type_2 predictor represents for that Pokémon instead of missing data, I first recode NA to and treat it as another level for Pokémon without a secondary type.

For the ANCOVA test on speed points variation with both Pokémon type after controlling for their weight, the regression model being tested is lm(speed ~ weight + type_1 + type_2_clean), where type_2_clean is the pre-processed type_2 column. I use centered parametrisation for type_1 and referenced parametrisation for type_2:

\(H_0: \beta_2 = ... = \beta_{37} = 0\)

\(H_1: \text{at least one of } \beta_2, ..., \beta_{37} \text{ is not 0}\)

pokemon$type_2_clean <- pokemon$type_2
pokemon$type_2_clean <- as.character(pokemon$type_2_clean)  # Converting to character
pokemon$type_2_clean[is.na(pokemon$type_2_clean)] <- "None"  # Replacing NA with "None"
pokemon$type_2_clean <- factor(pokemon$type_2_clean)  # Converting to factor

fit_null <- lm(speed ~ weight, data = pokemon) # Fitting model with only weight as the predictor
summary(fit_null)

Call:
lm(formula = speed ~ weight, data = pokemon)

Residuals:
    Min      1Q  Median      3Q     Max 
-64.338 -23.538  -3.502  21.466 111.021 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 68.485731   1.083139  63.229   <2e-16 ***
weight       0.008117   0.007909   1.026    0.305    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.21 on 947 degrees of freedom
Multiple R-squared:  0.001111,  Adjusted R-squared:  5.634e-05 
F-statistic: 1.053 on 1 and 947 DF,  p-value: 0.305
fit_ancova <- lm(speed ~ weight + type_1 + type_2_clean, data = pokemon) # Fitting model with weight, type_1, and type_2_clean as the predictors
summary(fit_ancova)

Call:
lm(formula = speed ~ weight + type_1 + type_2_clean, data = pokemon)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.109 -19.703  -2.529  17.279 103.253 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           55.629995  12.170735   4.571 5.53e-06 ***
weight                 0.014057   0.008177   1.719 0.085940 .  
type_1bug             -7.292930   3.278963  -2.224 0.026382 *  
type_1dark             3.036114   4.476398   0.678 0.497786    
type_1dragon          10.968005   4.382545   2.503 0.012501 *  
type_1electric        18.202351   3.514700   5.179 2.75e-07 ***
type_1fairy          -14.686879   5.996420  -2.449 0.014502 *  
type_1fighting        -0.762774   4.755946  -0.160 0.872615    
type_1fire             2.075252   3.558344   0.583 0.559898    
type_1flying          34.409277  13.018531   2.643 0.008356 ** 
type_1ghost           -3.728718   4.490821  -0.830 0.406588    
type_1grass          -11.335210   3.131451  -3.620 0.000311 ***
type_1ground          -5.202764   4.493541  -1.158 0.247236    
type_1ice             -2.911835   4.922119  -0.592 0.554277    
type_1normal          -2.122462   2.756726  -0.770 0.441546    
type_1poison          -7.076699   4.585233  -1.543 0.123089    
type_1psychic         10.136221   3.442677   2.944 0.003319 ** 
type_1rock            -7.650909   3.494827  -2.189 0.028834 *  
type_1steel          -13.914314   5.205136  -2.673 0.007648 ** 
type_2_cleandark      23.196028  13.221332   1.754 0.079691 .  
type_2_cleandragon    13.256355  13.587382   0.976 0.329503    
type_2_cleanelectric   2.212474  14.406771   0.154 0.877981    
type_2_cleanfairy     16.191321  12.910102   1.254 0.210106    
type_2_cleanfighting  30.387917  13.029971   2.332 0.019909 *  
type_2_cleanfire      27.531347  14.047287   1.960 0.050311 .  
type_2_cleanflying    30.984206  12.362657   2.506 0.012374 *  
type_2_cleanghost      6.660059  13.772932   0.484 0.628813    
type_2_cleangrass      3.783906  13.369221   0.283 0.777217    
type_2_cleanground     0.315651  12.840056   0.025 0.980393    
type_2_cleanice       14.692071  14.014330   1.048 0.294750    
type_2_cleanNone      10.125856  12.209878   0.829 0.407142    
type_2_cleannormal    20.046239  16.024419   1.251 0.211262    
type_2_cleanpoison    19.813553  13.078136   1.515 0.130115    
type_2_cleanpsychic   16.699176  12.978336   1.287 0.198527    
type_2_cleanrock     -14.986723  14.179230  -1.057 0.290815    
type_2_cleansteel     10.529991  13.161491   0.800 0.423884    
type_2_cleanwater      3.662208  13.640022   0.268 0.788383    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.83 on 912 degrees of freedom
Multiple R-squared:  0.1885,    Adjusted R-squared:  0.1564 
F-statistic: 5.884 on 36 and 912 DF,  p-value: < 2.2e-16
anova(fit_null, fit_ancova) # Test whether type_1 and type_2 have effect on speed after controlling for the weight
Analysis of Variance Table

Model 1: speed ~ weight
Model 2: speed ~ weight + type_1 + type_2_clean
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    947 807917                                  
2    912 656371 35    151546 6.0162 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANCOVA test gives a p-value of < 2.2e-16. Hence, \(H_0\) is rejected, and it is concluded that speed points vary with both Pokemon type after controlling for their weight (or both Pokemon types are statistically significant in predicting speed points after controlling for their weight).

Another ANCOVA test is employed to check whether there is any difference in the relationship between speed and weight depending on the primary type. The regression model being tested here is lm(formula = speed ~ weight * type_1 + type_2_clean), using centered parametrisation for type_1 and referenced parametrisation for type_2:

\(H_0: \beta_2 = ... = \beta_{55} = 0\)

\(H_1: \text{at least one of } \beta_2, ..., \beta_{55} \text{ is not 0}\)

fit_interaction <- lm(speed ~ weight * type_1 + type_2_clean, data = pokemon) # Fitting a model with weight-type_1 interaction term
summary(fit_interaction)

Call:
lm(formula = speed ~ weight * type_1 + type_2_clean, data = pokemon)

Residuals:
    Min      1Q  Median      3Q     Max 
-62.778 -19.529  -1.978  16.658 103.161 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            53.06652   12.22736   4.340 1.59e-05 ***
weight                  0.09606    0.03031   3.169 0.001583 ** 
type_1bug              -6.19180    4.08167  -1.517 0.129625    
type_1dark              9.31939    5.10690   1.825 0.068355 .  
type_1dragon            8.71837    6.12690   1.423 0.155095    
type_1electric         20.88453    4.48240   4.659 3.66e-06 ***
type_1fairy           -16.07347    6.73832  -2.385 0.017269 *  
type_1fighting         10.46493    6.94550   1.507 0.132235    
type_1fire              7.50378    4.81489   1.558 0.119480    
type_1flying          -14.54622   27.79963  -0.523 0.600928    
type_1ghost            -0.38190    5.12564  -0.075 0.940623    
type_1grass            -7.98541    3.95888  -2.017 0.043985 *  
type_1ground            0.01164    5.33328   0.002 0.998260    
type_1ice               1.44932    6.39139   0.227 0.820662    
type_1normal           -0.34238    3.53779  -0.097 0.922924    
type_1poison          -16.53370    7.38157  -2.240 0.025345 *  
type_1psychic          15.02223    4.03159   3.726 0.000207 ***
type_1rock              0.36464    4.84026   0.075 0.939965    
type_1steel           -11.30966    6.52238  -1.734 0.083266 .  
type_2_cleandark       22.68898   13.25687   1.711 0.087337 .  
type_2_cleandragon     13.01196   13.79004   0.944 0.345641    
type_2_cleanelectric    1.14413   14.42576   0.079 0.936802    
type_2_cleanfairy      15.45560   12.89118   1.199 0.230873    
type_2_cleanfighting   28.88244   13.03448   2.216 0.026953 *  
type_2_cleanfire       26.28351   14.05662   1.870 0.061834 .  
type_2_cleanflying     29.39924   12.33921   2.383 0.017399 *  
type_2_cleanghost       5.73528   13.77826   0.416 0.677323    
type_2_cleangrass       3.09080   13.35024   0.232 0.816966    
type_2_cleanground     -0.80176   12.86306  -0.062 0.950313    
type_2_cleanice        12.56806   14.09248   0.892 0.372725    
type_2_cleanNone        9.15629   12.19483   0.751 0.452950    
type_2_cleannormal     18.57613   15.99180   1.162 0.245706    
type_2_cleanpoison     18.77286   13.06230   1.437 0.151017    
type_2_cleanpsychic    15.63637   12.97964   1.205 0.228644    
type_2_cleanrock      -16.37200   14.16378  -1.156 0.248028    
type_2_cleansteel       9.37921   13.19393   0.711 0.477348    
type_2_cleanwater       3.58669   13.57942   0.264 0.791743    
weight:type_1bug       -0.01035    0.06207  -0.167 0.867634    
weight:type_1dark      -0.12119    0.04225  -2.869 0.004221 ** 
weight:type_1dragon    -0.03879    0.04221  -0.919 0.358350    
weight:type_1electric  -0.04950    0.08721  -0.568 0.570469    
weight:type_1fairy      0.14088    0.12482   1.129 0.259350    
weight:type_1fighting  -0.21690    0.09045  -2.398 0.016691 *  
weight:type_1fire      -0.10781    0.05081  -2.122 0.034130 *  
weight:type_1flying     0.87007    0.44485   1.956 0.050789 .  
weight:type_1ghost     -0.08018    0.04047  -1.981 0.047861 *  
weight:type_1grass     -0.07563    0.05160  -1.466 0.143131    
weight:type_1ground    -0.09305    0.03464  -2.686 0.007356 ** 
weight:type_1ice       -0.09068    0.04898  -1.851 0.064452 .  
weight:type_1normal    -0.03990    0.04577  -0.872 0.383586    
weight:type_1poison     0.27964    0.15792   1.771 0.076936 .  
weight:type_1psychic   -0.10274    0.03767  -2.727 0.006510 ** 
weight:type_1rock      -0.13567    0.04611  -2.943 0.003338 ** 
weight:type_1steel     -0.07761    0.03449  -2.250 0.024699 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.7 on 895 degrees of freedom
Multiple R-squared:  0.2113,    Adjusted R-squared:  0.1646 
F-statistic: 4.524 on 53 and 895 DF,  p-value: < 2.2e-16
anova(fit_ancova, fit_interaction) # Test the significance of weight-type_1 interaction term
Analysis of Variance Table

Model 1: speed ~ weight + type_1 + type_2_clean
Model 2: speed ~ weight * type_1 + type_2_clean
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1    912 656371                              
2    895 637904 17     18467 1.5241 0.07904 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANCOVA test gives a p-value of 0.07904 (> 0.05). Hence, \(H_0\) is accepted, and it is concluded that the relationship between speed and weight does not differ depending on the primary type. However, this ANCOVA test result is contradictory to the interpretation of Speed vs Weight by Pokémon Type plot. A closer look at the speed-weight data of each type reveals that high leverage points are present in the dataset (Speed vs Weight by Pokémon Type (Bug, Fighting, Grass, Normal) plot), creating regression lines that can’t be interpreted as such without any statistical test.

selected_types <- subset(pokemon, type_1 %in% c("bug", "fighting", "grass", "normal"))

ggplot(selected_types, aes(x = weight, y = speed)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Speed vs Weight by Pokémon Type (Bug, Fighting, Grass, Normal)", x = "Weight", y = "Speed") +
  theme_bw() +
  facet_wrap(~type_1, ncol = 2)

4 Conclusion

ANOVA test on the regression model of speed points against type_1 shows that the primary type of Pokémon is statistically significant in predicting the speed. Furthermore, the regression reveals that bug, dragon, electric, fairy, flying, grass, psychic, and steel types are statistically different from the overall speed.

By constructing linear combinations of the \(\mu_i\)’s, we’re able to see the effect of the number of legendary Pokémon on the speed points and the effect of the Pokémon’s combat style on the speed points. The ANOVA test on the first case shows that the model is statistically significant. Additionally, both legendary and non_legendary levels’ speed average differ from the overall average based on the regression model.

On the other hand, the ANOVA test on the latter case shows that the model is not statistically significant. However, the regression model demonstrates that defensive-type Pokémon’s average speed differs significantly from the overall average speed, while balanced and offensive types do not.

The ANCOVA test on speed variation with both Pokémon types after controlling for their weight concludes that the model without the interaction term between the primary type and weight is statistically significant.

5 Disclaimer

This report has been partially submitted as a part of STAT8130 Generalised Linear Modelling course in Semester 1 2026 in obtaining a degree in Master’s of Applied Data Analytics.