library(tidyverse)
library(ggfortify)Pokemon Project
Load the libraries
Brief Introduction
To start off, the data I will be using for this project comes from a public database called, “PokeAPI”. That includes the names of several Pokemon species such as their primary/secondary type, and characteristics. Out of curiosity I want to know whether there’s some sort of tie between a certain Pokemon’s height and weight. Which begs the question, whether taller Pokemon are usually heavier and how strong that relationship is.
Source:PokeAPI (https://pokeapi.co/)
Load and Cleaning Data
setwd("~/Data 110")
pokemon<-readr::read_csv("pokemon_data_pokeapi.csv")
# source: PokeAPIReplace missing Type2 values with “No Secondary Type”
pokemo<-pokemon%>%
mutate(Type2 = ifelse(is.na(Type2), "No Secondary Type", Type2))Filter out rows with missing Height or Weight values
pokemon<-pokemon%>%
filter(!is.na(`Height (m)`),!is.na(`Weight (kg)`))Check the cleaned data summary
summary(pokemon) Name Pokedex Number Type1 Type2
Length:905 Min. : 1 Length:905 Length:905
Class :character 1st Qu.:227 Class :character Class :character
Mode :character Median :453 Mode :character Mode :character
Mean :453
3rd Qu.:679
Max. :905
Classification Height (m) Weight (kg) Abilities
Length:905 Min. : 0.100 Min. : 0.10 Length:905
Class :character 1st Qu.: 0.500 1st Qu.: 8.50 Class :character
Mode :character Median : 1.000 Median : 28.00 Mode :character
Mean : 1.193 Mean : 64.29
3rd Qu.: 1.500 3rd Qu.: 65.50
Max. :20.000 Max. :999.90
Generation Legendary Status
Min. :1.000 Length:905
1st Qu.:2.000 Class :character
Median :4.000 Mode :character
Mean :4.177
3rd Qu.:6.000
Max. :8.000
Scatterplot: Height vs Weight
#Scatterplot with regression line
ggplot(pokemon, aes(x = `Height (m)`, y = `Weight (kg)`, color = Type1)) +
geom_point(alpha = 0.7, size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
labs(title = "Pokemon Height vs Weight", subtitle = "Each point represents one Pokemon species",
x = "Height(m)" ,
y = "Weight (kg)",
color = "Primary Type",
caption = "Source: PokeAPI (https://pokeapi.co/)" ) +
theme_minimal(base_size = 12) `geom_smooth()` using formula = 'y ~ x'
Correlation Between Height and Weight
#Calculate correlation between Height and Weight
cor(pokemon$`Height (m)`, pokemon$`Weight (kg)`) [1] 0.6424369
Linear Regression Analysis
#Simple linear regression: Weight predicted by Height
model<- lm(`Weight (kg)` ~ `Height (m)`, data = pokemon)
#Display summary results
summary(model)
Call:
lm(formula = `Weight (kg)` ~ `Height (m)`, data = pokemon)
Residuals:
Min 1Q Median 3Q Max
-493.10 -26.74 -13.95 -1.88 1003.50
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.818 4.229 -2.321 0.0205 *
`Height (m)` 62.133 2.466 25.191 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 91.42 on 903 degrees of freedom
Multiple R-squared: 0.4127, Adjusted R-squared: 0.4121
F-statistic: 634.6 on 1 and 903 DF, p-value: < 2.2e-16
Regression Equation
The result of the regression is the equation: Weight = -9.82 + 62.13(Height). For 1 meter increase of a Pokemon’s height, the Pokemon’s weight would increase by 62.13 kilograms in value on average. Even though -9.82 intercepts nothing real (a Pokémon’s height cannot be 0!), it remains part of the regression line in a mathematical sense, so that line might still inform.
Model Interpretation
As the p-value for Height exists below 2e-16 and below 0.05, Height is statistically important in predicting Weight. The R² value would adjust (R² = 0.813), meaning height variation comes from Pokemon weight to about 81.3%. Overall, this shows Height and Weight relate positively and strongly together, since heavier Pokemon tend to be taller, in agreement with what we would expect in real life.
Diagnostic Plot
autoplot(model, 1:4, nrow = 2, ncol = 2)As you can see in these plots, the Residuals vs Fitted plot should be able to show how the points are randomly scattered around 0 which is a sign of a good fit. For the QQ Plot, it seems to look like a rough straight line, which shows that the residuals are distributed normally. To end that, if both plots conditions are met then its safe to say that the linear regression model assumptions are reasonable.
Conclusion/Reflection
In this project, I wanted to see if Pokemon height had any relation to its weight. The answer to my question is yes based on the analysis of both scatter plot and regression which showed they have a strong and positive relationship. As for the data cleaning, I replaced missing Type2 values with “No Secondary Type” by mutate(), and filtered out rows with missing numeric values for Height and Weight by filter(!is.na(Height) & !is.na(Weight)). I took the cleaning approach used in the course without the na.omit() or drop_na() from within tidyverse. Finally, if I had more time, I would have check whether different Pokemon Types (Grass, Water, Fire, etc.) make a difference in Height-Weight scatter plots, or try using the Plotly package, which we learned in class.