For this assessment you will use a dataset about Pokemon. I wrangled these data from two different sources on Kaggle:
pokemon<-read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA151/main/Data/pokemonMid2.csv")
For decades, kids all over the world have been discovering the enchanting world of Pokémon (an abbreviation for Pocket Monsters). Many of those children become lifelong fans. Today, the Pokémon family of products includes video games, the Pokémon Trading Card Game (TCG), an animated series, movies, toys, books, and much more.
Pokémon are creatures of all shapes and sizes who live in the wild or alongside their human partners (called “Trainers”). During their adventures, Pokémon grow and become more experienced and even, on occasion, evolve into stronger Pokémon. Hundreds of known Pokémon inhabit the Pokémon universe, with untold numbers waiting to be discovered!
Learn about the variables available:
str(pokemon)
## 'data.frame': 751 obs. of 17 variables:
## $ PokeDexNumber : int 1 2 3 4 5 6 7 8 9 10 ...
## $ PokeName : chr "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
## $ Type : chr "Grass" "Grass" "Grass" "Fire" ...
## $ OtherType : chr "Poison" "Poison" "Poison" "" ...
## $ SumOfAttack : int 318 405 525 309 405 534 314 405 530 195 ...
## $ HitPoints : int 45 60 80 39 58 78 44 59 79 45 ...
## $ AttStrg : int 49 62 82 52 64 84 48 63 83 30 ...
## $ DefStrg : int 49 63 83 43 58 78 65 80 100 35 ...
## $ SpAttStrg : int 65 80 100 60 80 109 50 65 85 20 ...
## $ SpDefStrg : int 65 80 100 50 65 85 64 80 105 20 ...
## $ Speed : int 45 60 80 65 80 100 43 58 78 45 ...
## $ Generation : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Legendary : chr "False" "False" "False" "False" ...
## $ capture_rate : int 45 45 45 45 45 45 45 45 45 255 ...
## $ height_m : num 0.7 1 2 0.6 1.1 1.7 0.5 1 1.6 0.3 ...
## $ weight_kg : num 6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
## $ percentage_male: num 88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 50 ...
When Professor Smalley was a child she used to collect Pokemon cards. In the first generation of Pokemon there were “150 or more to see”.
Question: (4 points) Which generation introduced the most new Pokémon and how many were in that generation?
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
pokemon%>%
count(Generation)%>%
arrange(desc(n))
## Generation n
## 1 5 164
## 2 1 151
## 3 3 140
## 4 4 116
## 5 2 99
## 6 6 81
Look up the PokéRap after class if you’ve never heard the song.
Question: (4 points) Looking only at the main type. How many different types of Pokemon are there? Which Pokemon type are there the most species of?
species<-pokemon%>%
count(Type)%>%
arrange(desc(n))
## HOW MANY TYPES
dim(species)
## [1] 18 2
## COUNT SPECIES BY TYPE
head(species)
## Type n
## 1 Water 107
## 2 Normal 94
## 3 Grass 66
## 4 Bug 65
## 5 Psychic 52
## 6 Fire 48
Each species of Pokémon has a catch rate that applies to all its members. When a Poké Ball is thrown at a wild Pokémon, the game uses that Pokémon’s catch rate in a formula to determine the chances of catching that Pokémon. Higher catch rates mean that the Pokémon is easier to catch, up to a maximum of 255.
Source: https://bulbapedia.bulbagarden.net/wiki/Catch_rate
Question: (4 points) Create a histogram plot for the
distribution of capture_rate
. Test different amounts of
bins. What number of bins best illustrates the shape of the data?
NOTE: There is not one right answer, this is about articulating your point and providing support for your choice.
ggplot(data=pokemon, aes(x=capture_rate))+
geom_histogram(bins=7)
Question: (4 points) Create a new data frame to show which Pokemon type has the fastest average speed.
State which Type has the fastest average speed and what that speed is.
fastest<-pokemon%>%
group_by(Type)%>%
summarise(avgSpeed=mean(Speed, na.rm=TRUE))%>%
arrange(desc(avgSpeed))
## TOP 6
head(fastest)
## # A tibble: 6 × 2
## Type avgSpeed
## <chr> <dbl>
## 1 Flying 102.
## 2 Electric 84.2
## 3 Dragon 78.1
## 4 Psychic 77.2
## 5 Dark 75.4
## 6 Fire 74.0
Legendary Pokémon are a group of incredibly rare and often very powerful Pokémon, generally featured prominently in the legends and myths of the Pokémon world.
Source: https://bulbapedia.bulbagarden.net/wiki/Legendary_Pok%C3%A9mon
Question: (4 points) Create a new dataframe to summarise the following information for each Legendary status group:
State your observations.
legendary<-pokemon%>%
group_by(Legendary)%>%
summarise(n=n(),
avgSpeed=mean(Speed, na.rm = TRUE),
avgHP=mean(HitPoints, na.rm=TRUE),
avgCR=mean(capture_rate, na.rm=TRUE))
## Legendary
legendary
## # A tibble: 2 × 5
## Legendary n avgSpeed avgHP avgCR
## <chr> <int> <dbl> <dbl> <dbl>
## 1 False 692 63.9 66.6 105.
## 2 True 59 98.3 93.2 6.56
Question: (4 points) Create a side-by-side boxplot to show the distribution of speeds by Legendary. Use the proper aesthetic so that each box contains the color for the Legendary status.
State your observations.
ggplot(data=pokemon, aes(x=Legendary, y=Speed, fill=Legendary))+
geom_boxplot()
Question: (4 points) Now, just looking at water type Pokemon, describe the shape of the distribution for the sum of attack points (`SumOfAttack’)? Use the most appropriate geometries to create plot(s).
Be sure to include commentary on symmetry/skew, modality, spread, and outliers.
water<-pokemon%>%
filter(Type=="Water")
## BIOMODAL SHAPE
ggplot(data=water, aes(x=SumOfAttack))+
geom_density()
## OUTLIERS?
ggplot(data=water, aes(x=SumOfAttack))+
geom_boxplot()
Question: (4 points) Please convert the values for height and weight to the imperial system, which is standard in the United States.
Create a new data frame to accomplish this so that we can use these values in subsequent Parts 8, 9, and 10.
imperial<-pokemon%>%
mutate(height_in=height_m*39.3701,
weight_lb =weight_kg*2.205)
str(imperial)
## 'data.frame': 751 obs. of 19 variables:
## $ PokeDexNumber : int 1 2 3 4 5 6 7 8 9 10 ...
## $ PokeName : chr "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
## $ Type : chr "Grass" "Grass" "Grass" "Fire" ...
## $ OtherType : chr "Poison" "Poison" "Poison" "" ...
## $ SumOfAttack : int 318 405 525 309 405 534 314 405 530 195 ...
## $ HitPoints : int 45 60 80 39 58 78 44 59 79 45 ...
## $ AttStrg : int 49 62 82 52 64 84 48 63 83 30 ...
## $ DefStrg : int 49 63 83 43 58 78 65 80 100 35 ...
## $ SpAttStrg : int 65 80 100 60 80 109 50 65 85 20 ...
## $ SpDefStrg : int 65 80 100 50 65 85 64 80 105 20 ...
## $ Speed : int 45 60 80 65 80 100 43 58 78 45 ...
## $ Generation : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Legendary : chr "False" "False" "False" "False" ...
## $ capture_rate : int 45 45 45 45 45 45 45 45 45 255 ...
## $ height_m : num 0.7 1 2 0.6 1.1 1.7 0.5 1 1.6 0.3 ...
## $ weight_kg : num 6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
## $ percentage_male: num 88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 50 ...
## $ height_in : num 27.6 39.4 78.7 23.6 43.3 ...
## $ weight_lb : num 15.2 28.7 220.5 18.7 41.9 ...
Question: (4 points) It appears that the distributions for height and weight are skewed. When there is skew in your data, what metric should be used for center? Support your answer.
Using this metric, summarise your data to find the center value for the height and weight for each Type.
HINT: CAUTION WITH NA’s.
imperial%>%
group_by(Type)%>%
summarise(medHeight=median(height_in, na.rm=TRUE),
medWeight=median(weight_lb, na.rm=TRUE))
## # A tibble: 18 × 3
## Type medHeight medWeight
## <chr> <dbl> <dbl>
## 1 Bug 31.5 32.0
## 2 Dark 39.4 63.9
## 3 Dragon 70.9 218.
## 4 Electric 23.6 33.5
## 5 Fairy 23.6 16.5
## 6 Fighting 47.2 88.2
## 7 Fire 39.4 84.3
## 8 Flying 59.1 139.
## 9 Ghost 41.3 33.1
## 10 Grass 31.5 32.0
## 11 Ground 43.3 150.
## 12 Ice 43.3 122.
## 13 Normal 37.4 54.4
## 14 Poison 43.3 59.5
## 15 Psychic 33.5 41.3
## 16 Rock 47.2 132.
## 17 Steel 35.4 133.
## 18 Water 39.4 62.8
Question: (4 points) Create a new column for the height (in) / weight (lbs) ratio for each Pokemon. Then find the the Pokemon type with the highest average height-weight ratio.
HINT: CAUTION WITH NA’s.
ratio<-imperial%>%
mutate(ratio=height_in/weight_lb)%>%
group_by(Type)%>%
summarise(avgRatio=mean(ratio, na.rm=TRUE))
## Top 6
head(ratio)
## # A tibble: 6 × 2
## Type avgRatio
## <chr> <dbl>
## 1 Bug 1.25
## 2 Dark 0.818
## 3 Dragon 0.955
## 4 Electric 3.61
## 5 Fairy 3.00
## 6 Fighting 0.553