# Read the dataset
pokemon_data <- read.csv("PokemonStats.csv")
We will take “Total Stat” as the response variable. We can predict
“Total Stat” based on other stats like SpAtk, Attack, SpDef, and
HP.
# Build the linear regression model
linear_model <- lm(Total ~ SpAtk + Attack + SpDef + HP, data=pokemon_data)
# Display the summary of the model
summary(linear_model)
##
## Call:
## lm(formula = Total ~ SpAtk + Attack + SpDef + HP, data = pokemon_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -119.776 -20.522 -0.808 16.921 128.459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.94126 3.12578 14.38 <2e-16 ***
## SpAtk 1.17452 0.03235 36.31 <2e-16 ***
## Attack 1.63702 0.03139 52.16 <2e-16 ***
## SpDef 1.56948 0.03830 40.98 <2e-16 ***
## HP 0.91633 0.03881 23.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.18 on 1189 degrees of freedom
## Multiple R-squared: 0.938, Adjusted R-squared: 0.9378
## F-statistic: 4497 on 4 and 1189 DF, p-value: < 2.2e-16
# For heteroskedasticity-robust standard errors
coeftest(linear_model, vcov = vcovHC(linear_model, type="HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.941257 3.501719 12.834 < 2.2e-16 ***
## SpAtk 1.174520 0.037759 31.106 < 2.2e-16 ***
## Attack 1.637019 0.035666 45.899 < 2.2e-16 ***
## SpDef 1.569484 0.049459 31.733 < 2.2e-16 ***
## HP 0.916333 0.052079 17.595 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1. R value: 0.938
- This suggests that approximately 93.8% of the variation in the
“Total Stat” can be explained by our model. This is a pretty high value,
indicating that our model fits the data well.
2. Coefficients:
- Intercept (const): 44.9413 - This is the estimated “Total Stat” when
all predictors are zero.
- SpAtk: 1.1745 - For every unit increase in SpAtk, the “Total Stat”
is expected to increase by 1.1745 units, keeping other variables
constant.
- Attack: 1.6370 - For every unit increase in Attack, the “Total Stat”
is expected to increase by 1.6370 units.
- SpDef: 1.5695 - For every unit increase in SpDef, the “Total Stat”
is expected to increase by 1.5695 units.
- HP: 0.9163 - For every unit increase in HP, the “Total Stat” is
expected to increase by 0.9163 units.
3. Significance:
- All predictors are statistically significant (given their p-values
are close to 0).
We’ll now diagnose the model using residual plots and a QQ
plot.
# 1. Residuals vs. Fitted Values Plot
fitted_values <- fitted(linear_model)
resid_values <- residuals(linear_model)
plot1 <- ggplot(data.frame(fitted=fitted_values, residuals=resid_values), aes(x=fitted, y=residuals)) +
geom_point(aes(color=fitted), alpha=0.5) +
scale_color_gradient(low="blue", high="cyan") +
geom_smooth(aes(group=1), method="loess", col="gold") +
theme_minimal() +
theme(legend.position="none") +
ggtitle('Residuals vs Fitted') +
xlab('Fitted values') +
ylab('Residuals')
# 2. QQ Plot
plot2 <- ggplot(data.frame(residuals=resid_values), aes(sample=residuals)) +
stat_qq(color="lightgreen", distribution=qnorm, dparams=list(mean=mean(resid_values), sd=sd(resid_values))) +
geom_abline(intercept = 0, slope = 1, color = "purple") +
theme_minimal() +
ggtitle('Normal Q-Q')
# Display plots
grid.arrange(plot1, plot2, ncol=2)
## `geom_smooth()` using formula = 'y ~ x'

Overview of the plots:
1. Residuals vs. Fitted Values Plot:
- The purpose of this plot is to check the assumption of linearity and
equal variance (homoscedasticity).
- Ideally, the residuals should be scattered randomly around zero, and
there should be no apparent pattern.
- From our plot, the residuals seem to hover around the zero line for
the most part, indicating that linearity is reasonably met. However,
there are some deviations, especially for higher fitted values,
suggesting potential non-linearity or heteroscedasticity.
2. Q-Q Plot:
- This plot is used to check the assumption of normally distributed
residuals.
- If the residuals are normally distributed, they should roughly lie
on the diagonal purple line.
- Residuals closely follow the purple line, suggesting that they’re
approximately normally distributed.
Both plots seem decent, although there’s some minor deviation in the
Residuals vs. Fitted Values Plot.
Potential Issues with the Model:
1. Linearity & Homoscedasticity:
- Observation: While the residuals largely hover
around the zero line in the Residuals vs. Fitted Values plot, there are
deviations, especially for higher fitted values. This could suggest
potential non-linearity or heteroscedasticity in the model.
- Implication: If the model’s assumptions are
violated, it might lead to unreliable and biased estimates.
- Recommendation: We can consider exploring
non-linear transformations or adding interaction terms to better capture
the relationship.
2. Normal Distribution of Residuals:
- Observation: The residuals appear to be
approximately normally distributed, as observed from the QQ plot. This
assumption is crucial for hypothesis testing and for constructing
confidence intervals.
- Implication: Currently, this assumption seems to be
met, which is a positive aspect of our model.
3. Outliers and Leverage Points:
- Observation: Outliers can unduly influence the
model’s fit, leading to biased estimates. We haven’t specifically
addressed this in our initial analysis.
- Recommendation: We can investigate potential
outliers or high-leverage points using Cook’s distance or leverage
vs. residual squared plots. Addressing these points can improve the
model’s robustness.
4. Model Fit:
- Observation: Our model has a high \(R^2\) value, suggesting a good fit.
However, a high \(R^2\) can sometimes
indicate overfitting, especially with many predictors.
- Recommendation: We can consider using techniques
like cross-validation to ensure that the model generalizes well to new
data.
Coefficient Interpretation:
1. Intercept (const): 44.9413
- The estimated “Total Stat” when all predictors (SpAtk, Attack,
SpDef, HP) are zero. Given the nature of the data, this value is mainly
theoretical, as it’s not realistic for these stats to be zero.
2. SpAtk: 1.1745
- For every unit increase in the SpAtk value, the “Total Stat” is
expected to increase by approximately 1.1745 units, keeping other
variables constant.
- This suggests that SpAtk is a significant contributor to a Pokémon’s
total stats.
3. Attack: 1.6370
- For every unit increase in the Attack value, the “Total Stat” is
expected to increase by approximately 1.6370 units.
- Among the predictors we’ve used, Attack has the highest coefficient,
indicating its strong influence on the total stats.
4. SpDef: 1.5695
- For every unit increase in the SpDef value, the “Total Stat” is
expected to increase by approximately 1.5695 units.
- SpDef also has a substantial impact on the total stats, nearly as
much as Attack.
5. HP: 0.9163
- For every unit increase in HP, the “Total Stat” is expected to
increase by approximately 0.9163 units.
- While HP has the lowest coefficient among our predictors, it still
plays an essential role in determining the total stats.
Insights:
- Both Attack and SpDef have strong positive relationships with the
“Total Stat”, with Attack being the most influential.
- A Pokémon’s special attack (SpAtk) and hit points (HP) also
positively influence its total stats, but to a slightly lesser extent
than Attack and SpDef.
Significance:
- Understanding which stats have the most influence on a Pokémon’s
“Total Stat” can be crucial for players aiming to build a strong team.
For instance, prioritizing Pokémon with high Attack and SpDef can be a
strategy for those wanting to maximize their team’s overall
strength.
Further Questions:
- Are there interactions between these stats that further amplify the
total stat value?
- How do other variables, like Type1 or Type2, influence the total
stats? Could including them improve the model’s predictive power or
offer more nuanced insights?
- Are there specific Pokémon types that tend to have higher Attack or
SpDef values on average?