# Read the dataset
pokemon_data <- read.csv("PokemonStats.csv")

We will take “Total Stat” as the response variable. We can predict “Total Stat” based on other stats like SpAtk, Attack, SpDef, and HP.

# Build the linear regression model
linear_model <- lm(Total ~ SpAtk + Attack + SpDef + HP, data=pokemon_data)

# Display the summary of the model
summary(linear_model)
## 
## Call:
## lm(formula = Total ~ SpAtk + Attack + SpDef + HP, data = pokemon_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -119.776  -20.522   -0.808   16.921  128.459 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44.94126    3.12578   14.38   <2e-16 ***
## SpAtk        1.17452    0.03235   36.31   <2e-16 ***
## Attack       1.63702    0.03139   52.16   <2e-16 ***
## SpDef        1.56948    0.03830   40.98   <2e-16 ***
## HP           0.91633    0.03881   23.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.18 on 1189 degrees of freedom
## Multiple R-squared:  0.938,  Adjusted R-squared:  0.9378 
## F-statistic:  4497 on 4 and 1189 DF,  p-value: < 2.2e-16
# For heteroskedasticity-robust standard errors
coeftest(linear_model, vcov = vcovHC(linear_model, type="HC1"))
## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 44.941257   3.501719  12.834 < 2.2e-16 ***
## SpAtk        1.174520   0.037759  31.106 < 2.2e-16 ***
## Attack       1.637019   0.035666  45.899 < 2.2e-16 ***
## SpDef        1.569484   0.049459  31.733 < 2.2e-16 ***
## HP           0.916333   0.052079  17.595 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1. R value: 0.938

  • This suggests that approximately 93.8% of the variation in the “Total Stat” can be explained by our model. This is a pretty high value, indicating that our model fits the data well.

2. Coefficients:

  1. Intercept (const): 44.9413 - This is the estimated “Total Stat” when all predictors are zero.
  2. SpAtk: 1.1745 - For every unit increase in SpAtk, the “Total Stat” is expected to increase by 1.1745 units, keeping other variables constant.
  3. Attack: 1.6370 - For every unit increase in Attack, the “Total Stat” is expected to increase by 1.6370 units.
  4. SpDef: 1.5695 - For every unit increase in SpDef, the “Total Stat” is expected to increase by 1.5695 units.
  5. HP: 0.9163 - For every unit increase in HP, the “Total Stat” is expected to increase by 0.9163 units.

3. Significance:

  • All predictors are statistically significant (given their p-values are close to 0).

We’ll now diagnose the model using residual plots and a QQ plot.

# 1. Residuals vs. Fitted Values Plot
fitted_values <- fitted(linear_model)
resid_values <- residuals(linear_model)

plot1 <- ggplot(data.frame(fitted=fitted_values, residuals=resid_values), aes(x=fitted, y=residuals)) +
  geom_point(aes(color=fitted), alpha=0.5) + 
  scale_color_gradient(low="blue", high="cyan") +
  geom_smooth(aes(group=1), method="loess", col="gold") +  
  theme_minimal() +
  theme(legend.position="none") + 
  ggtitle('Residuals vs Fitted') +
  xlab('Fitted values') +
  ylab('Residuals')

# 2. QQ Plot
plot2 <- ggplot(data.frame(residuals=resid_values), aes(sample=residuals)) +
  stat_qq(color="lightgreen", distribution=qnorm, dparams=list(mean=mean(resid_values), sd=sd(resid_values))) +  
  geom_abline(intercept = 0, slope = 1, color = "purple") +  
  theme_minimal() +
  ggtitle('Normal Q-Q')

# Display plots 
grid.arrange(plot1, plot2, ncol=2)
## `geom_smooth()` using formula = 'y ~ x'

Overview of the plots:

1. Residuals vs. Fitted Values Plot:

  • The purpose of this plot is to check the assumption of linearity and equal variance (homoscedasticity).
  • Ideally, the residuals should be scattered randomly around zero, and there should be no apparent pattern.
  • From our plot, the residuals seem to hover around the zero line for the most part, indicating that linearity is reasonably met. However, there are some deviations, especially for higher fitted values, suggesting potential non-linearity or heteroscedasticity.

2. Q-Q Plot:

  • This plot is used to check the assumption of normally distributed residuals.
  • If the residuals are normally distributed, they should roughly lie on the diagonal purple line.
  • Residuals closely follow the purple line, suggesting that they’re approximately normally distributed.

Both plots seem decent, although there’s some minor deviation in the Residuals vs. Fitted Values Plot.

Potential Issues with the Model:

1. Linearity & Homoscedasticity:

  • Observation: While the residuals largely hover around the zero line in the Residuals vs. Fitted Values plot, there are deviations, especially for higher fitted values. This could suggest potential non-linearity or heteroscedasticity in the model.
  • Implication: If the model’s assumptions are violated, it might lead to unreliable and biased estimates.
  • Recommendation: We can consider exploring non-linear transformations or adding interaction terms to better capture the relationship.

2. Normal Distribution of Residuals:

  • Observation: The residuals appear to be approximately normally distributed, as observed from the QQ plot. This assumption is crucial for hypothesis testing and for constructing confidence intervals.
  • Implication: Currently, this assumption seems to be met, which is a positive aspect of our model.

3. Outliers and Leverage Points:

  • Observation: Outliers can unduly influence the model’s fit, leading to biased estimates. We haven’t specifically addressed this in our initial analysis.
  • Recommendation: We can investigate potential outliers or high-leverage points using Cook’s distance or leverage vs. residual squared plots. Addressing these points can improve the model’s robustness.

4. Model Fit:

  • Observation: Our model has a high \(R^2\) value, suggesting a good fit. However, a high \(R^2\) can sometimes indicate overfitting, especially with many predictors.
  • Recommendation: We can consider using techniques like cross-validation to ensure that the model generalizes well to new data.

Coefficient Interpretation:

1. Intercept (const): 44.9413

  • The estimated “Total Stat” when all predictors (SpAtk, Attack, SpDef, HP) are zero. Given the nature of the data, this value is mainly theoretical, as it’s not realistic for these stats to be zero.

2. SpAtk: 1.1745

  • For every unit increase in the SpAtk value, the “Total Stat” is expected to increase by approximately 1.1745 units, keeping other variables constant.
  • This suggests that SpAtk is a significant contributor to a Pokémon’s total stats.

3. Attack: 1.6370

  • For every unit increase in the Attack value, the “Total Stat” is expected to increase by approximately 1.6370 units.
  • Among the predictors we’ve used, Attack has the highest coefficient, indicating its strong influence on the total stats.

4. SpDef: 1.5695

  • For every unit increase in the SpDef value, the “Total Stat” is expected to increase by approximately 1.5695 units.
  • SpDef also has a substantial impact on the total stats, nearly as much as Attack.

5. HP: 0.9163

  • For every unit increase in HP, the “Total Stat” is expected to increase by approximately 0.9163 units.
  • While HP has the lowest coefficient among our predictors, it still plays an essential role in determining the total stats.

Insights:

  • Both Attack and SpDef have strong positive relationships with the “Total Stat”, with Attack being the most influential.
  • A Pokémon’s special attack (SpAtk) and hit points (HP) also positively influence its total stats, but to a slightly lesser extent than Attack and SpDef.

Significance:

  • Understanding which stats have the most influence on a Pokémon’s “Total Stat” can be crucial for players aiming to build a strong team. For instance, prioritizing Pokémon with high Attack and SpDef can be a strategy for those wanting to maximize their team’s overall strength.

Further Questions:

  1. Are there interactions between these stats that further amplify the total stat value?
  2. How do other variables, like Type1 or Type2, influence the total stats? Could including them improve the model’s predictive power or offer more nuanced insights?
  3. Are there specific Pokémon types that tend to have higher Attack or SpDef values on average?