Exploring Pokemon dataset:

Finding more details about our dataset:

Correlations
Normality tests

Finding patterns:

I’ll be dealing with following Hypothesises

Hypothesis Q1: Are Legendary Pokemons better then Non- Legendary ones?
Hypothesis Q2: Did Pokemons get better with generations?
Hypothesis Q3: Is Bigger the better, always? Does size matter for pokemons?
Hypothesis Q4: Are Bigger and Heavier Pokemons hard to capture?
Hypothesis Q5: Are Fighting pokemons better in attack and defence?

Setting working directory

getwd()

## [1] "X:/1.Study/4th year semester 2/Biostat/Assignment 2"

setwd("X:/1.Study/4th year semester 2/Biostat/Assignment 2")
getwd()

## [1] "X:/1.Study/4th year semester 2/Biostat/Assignment 2"

Importing pokemon dataset

pokemon<- read.csv("pokemon.csv", header = T)

Importing libraries

library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(GGally)
library(tidyr)
library(ggthemes)
library(gridExtra)
library(cowplot)

Basic Properties of Pokemon dataset

Checking properties of Pokemon dataset

dim(pokemon)

## [1] 801  41

So there are 801 rows and 4 columns

Checking other properties

str(pokemon)

So there are a lot of numeric variables in the datasets, and if we want to check corelation between two variables so they must be numeric, so we will select out numeric variables only.

poknum<- select_if(pokemon, is.numeric)

Coorelation and Distribution of Variables

ggpairs(pokemon, columns = c('attack', 'defense', 'hp', 'sp_attack', 'sp_defense', 'speed'), col="red") +
  theme_bw() +
  labs(title = 'Correlation Matrix of Pokemon Stats')

Number of pokemons by type

type1num<- pokemon %>%
  group_by(type1) %>%
  summarise(number = n()) %>%
  ggplot(aes(x = reorder(type1, number), y = number , fill = type1)) +
  geom_bar(stat = 'identity') +
  xlab(label = "Type of Pokemon") +
  ylab(label = "Number of Pokemon") +
  ggtitle(label = "Number of Pokemon Type 1") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(legend.position="none") +
  coord_flip() +
  geom_text(aes(label = number), hjust = -1.0)

## `summarise()` ungrouping output (override with `.groups` argument)

type2num<- pokemon %>%
  group_by(type2) %>%
  summarise(number = n()) %>%
  ggplot(aes(x = reorder(type2, number), y = number , fill = type2)) +
  geom_bar(stat = 'identity') +
  xlab(label = "Type of Pokemon") +
  ylab(label = "Number of Pokemon") +
  ggtitle(label = "Number of Pokemon Type 2") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(legend.position="none") +
  coord_flip() +
  geom_text(aes(label = number), hjust = -1.0)

## `summarise()` ungrouping output (override with `.groups` argument)

plot_grid(type1num, type2num, labels=c("A", "B"), ncol = 2, nrow = 1)

Hypothesis 1:

Are Legendary Pokemons better then Non- Legendary ones?

As per Name suggests Legendary ones must be more powerful then the non legendary ones, so we will be checking our hypothesis in this assignment.

Extracting Legendary and Non legendary pokemons

is_legendary<- pokemon[pokemon$is_legendary==1, ]
not_legendary<- pokemon[pokemon$is_legendary==0, ]
legendary <- data.frame(
  Type = c("Legendary", "Not Legendary"),
  Numbers = c(nrow(is_legendary), nrow(not_legendary))
  )
head(legendary)

##            Type Numbers
## 1     Legendary      70
## 2 Not Legendary     731

Plotting Pie chart

ggplot(legendary, aes(x="", y= Numbers, fill= Type))+ 
  geom_bar(width = 1, stat = "identity")+
  coord_polar("y", start=0)+
  scale_fill_brewer(palette="Blues")+
  theme_minimal()

By above pie chart we can say Non legendary ones cover a large fraction of Pokemon population.

Checking Base total of Legendary vs non Legendary cards

If Legendary pokemons are really better their base total must be higher than the non legendary ones, as base total is the average of attack, defence and many other parameters so base total will be a good parameter to judge the power of a pokemon.

Hypothesis1: Box Plot

boxplot(pokemon$base_total~pokemon$is_legendary,main="Base total of Legendary vs Non Legendary Pokemons", names = c("Not Legendary","Legendary"), col = c("#A6B91A", "#A6B91A"), xlab = "Is_legendary", ylab = "Base Total")

stripchart(pokemon$base_total~pokemon$is_legendary, vertical = TRUE, method = "jitter", add = TRUE, pch = 20, col = c("#077eff", "#077eff"))

So by above boxplot it seems that base total of Legendary pokemon is higher than the non-legendary ones.

To proceed further we have to check normality first by following methods

Hypothesis1: Normality test by Histogram

par(mfrow=c(1,3))
hist(is_legendary$base_total, main = "Legendary Pokemons", xlab = "Base Total of Legendary pokemons", col= "#8BC3B6")
hist(not_legendary$base_total, main = "Non Legendary Pokemons", xlab = "Base Total of Non Legendary pokemons", col= "#8BC3B6")
hist(rnorm(5000,mean=500,sd=50), main = "Normal control", xlab = "Normal function", col="#A6B91A")

By the above histogram the base total data for Legendary and Non legendary doesn’t seem to be a normally distributed

Hypothesis1: Normality test by qqplot

par(mfrow=c(1,3))
qqnorm(is_legendary$base_total, col = "#8BC3B6", main = "Legendary Pokemon")
qqline(is_legendary$base_total)

qqnorm(not_legendary$base_total, col="#8BC3B6", main = "Non Legendary")
qqline(not_legendary$base_total)

qqnorm(rnorm(5000,mean=500,sd=50), col="#A6B91A", main = "Normal Control")
qqline(rnorm(5000,mean=500,sd=50))

By the above qqplots the base total data for Legendary and Non legendary doesn’t seem to be a normally distributed

Hypothesis1: Normality test by Shapiro test

Null Hypothesis: Dataset is Normally Distributed

shapiro.test(is_legendary$base_total)

## 
##  Shapiro-Wilk normality test
## 
## data:  is_legendary$base_total
## W = 0.76868, p-value = 3.834e-09

P value is not significant so we can reject our Null Hypotheis, mean base total for Legendary pokemons isn’t normally distributed.

shapiro.test(not_legendary$base_total)

## 
##  Shapiro-Wilk normality test
## 
## data:  not_legendary$base_total
## W = 0.97442, p-value = 5.301e-10

P value is not significant so we can reject our Null Hypotheis, mean base total for Non Legendary Pokemons isn’t normally distributed.

Hypothesis 1: Testing

As we can only reject the hypothesis, as by boxplot above we can visualise the mean of base total of Legendary pokemons is greater then that of Non legendary pokemons so we can say. ##### Wilcoxon test * Null Hypothesis: Mean of base total of Legendary pokemons is lesser. * Alternate Hypothesis: Mean of base total of Legendary pokemons is greater.

wilcox.test(is_legendary$base_total, not_legendary$base_total, alternative = "greater")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  is_legendary$base_total and not_legendary$base_total
## W = 48475, p-value < 2.2e-16
## alternative hypothesis: true location shift is greater than 0

As we can see the P value is very less so we can reject our hypothesis.

Conclusion

By above tests we can say that Legendary pokemons are better than non legendary pokemons

Hypthesis 2:

Do Pokemons improve with generations?

Test 1: Hypothesis 2

Pokemons distribution by generation

gen1<- poknum[poknum$generation== 1,]
gen2<- poknum[poknum$generation== 2,]
gen3<- poknum[poknum$generation== 3,]
gen4<- poknum[poknum$generation== 4,]
gen5<- poknum[poknum$generation== 5,]
gen6<- poknum[poknum$generation== 6,]
gen7<- poknum[poknum$generation== 7,]

Pie chart for Legendaries

Generations <- data.frame(
  Generation = c("Generation 1", "Generation 2", "Generation 3","Generation 4", "Generation 5", "Generation 6", "Generation 7"),
  Numbers = c(nrow(gen1), nrow(gen2), nrow(gen3), nrow(gen4), nrow(gen5), nrow(gen6), nrow(gen7))
  )


ggplot(Generations, aes(x="", y= Numbers, fill= Generation))+ 
  geom_bar(width = 1, stat = "identity")+
  coord_polar("y", start=0)+
  scale_fill_brewer(palette="Blues")+
  theme_minimal()

Are legendary pokemons increasing with generations?

As we have seen above Lgendary pokemons are better than non legendary ones so if Pokemons are getting better with generations we should also see more legendary pokemons with generations.

pokemon_edit <- pokemon # Just so we dont mess up dataset by mistake
pokemon_edit$is_legendary<-recode(pokemon_edit$is_legendary, "0" = "No", "1" ="Yes")

ggplot(pokemon_edit, aes(x = generation, fill = is_legendary)) +
  geom_bar() + ggtitle("Pokemon Generation wise Distribution frequency")

Boxplot

par(mfrow=c(2,2))
boxplot(pokemon$base_total~pokemon$generation, main="Base total vs Generation", xlab = "Generation", ylab = "Basetotal", col= "#8BC3B6")
boxplot(pokemon$attack~pokemon$generation, main="Attack vs Generation", xlab = "Generation", ylab = "Attack", col= "#8BC3B6")
boxplot(pokemon$defense~pokemon$generation, main="Defence vs Generation", xlab = "Generation", ylab = "Defence", col= "#8BC3B6")
boxplot(pokemon$weight_kg~pokemon$generation, main="Weight vs Generation", xlab = "Generation", ylab = "Weight", col= "#8BC3B6")

I dont see any significant difference between any generations so to proceed with our hypothesis we will take base total as our measuring criteria.

boxplot(pokemon$base_total~pokemon$generation, main="Base total vs Generation", xlab = "Generation", ylab = "Basetotal", col= "#8BC3B6")

stripchart(pokemon$base_total~pokemon$generation, vertical = TRUE, method = "jitter", add = TRUE, pch = 20, col = c("#A6B91A","#705746","#6F35FC","#F7D02C","#D685AD","#C22E28",
           "#EE8130"))

Before we proceed checking our hypothesis we will first do normality test for the above

Hypothesis2: Normality test by Histogram

par(mfrow=c(3,3))
hist(gen1$base_total, main = "Generation 1", xlab = "Base Total", col="#8BC3B6")
hist(gen2$base_total, main = "Generation 2", xlab = "Base Total", col="#8BC3B6")
hist(gen3$base_total, main = "Generation 3", xlab = "Base Total", col="#8BC3B6")
hist(gen4$base_total, main = "Generation 4", xlab = "Base Total", col="#8BC3B6")
hist(gen5$base_total, main = "Generation 5", xlab = "Base Total", col="#8BC3B6")
hist(gen6$base_total, main = "Generation 6", xlab = "Base Total", col="#8BC3B6")
hist(gen7$base_total, main = "Generation 7", xlab = "Base Total", col="#8BC3B6")
hist(rnorm(5000,mean=500,sd=50), main = "Normal control", xlab = "Normal function", col="#A6B91A")

Hypothesis2: Normality test by qqplot

par(mfrow=c(3,3))
qqnorm(gen1$base_total, col = "#8BC3B6", main = "Generation 1" )
qqline(gen1$base_total)

qqnorm(gen2$base_total, col="#8BC3B6", main = "Generation 2")
qqline(gen2$base_total)

qqnorm(gen3$base_total, col = "#8BC3B6", main = "Generation 3")
qqline(gen3$base_total)

qqnorm(gen4$base_total, col="#8BC3B6", main = "Generation 4")
qqline(gen4$base_total)

qqnorm(gen5$base_total, col = "#8BC3B6", main = "Generation 5")
qqline(gen5$base_total)

qqnorm(gen6$base_total, col="#8BC3B6", main = "Generation 6")
qqline(gen6$base_total)

qqnorm(gen7$base_total, col = "#8BC3B6", main = "Generation 7")
qqline(gen7$base_total)

qqnorm(rnorm(5000,mean=500,sd=50), col="#A6B91A", main = "Normal Control")
qqline(rnorm(5000,mean=500,sd=50))

Hypothesis2: Normality test by Shapiro test

shapiro.test(gen1$base_total)

## 
##  Shapiro-Wilk normality test
## 
## data:  gen1$base_total
## W = 0.97494, p-value = 0.007344

shapiro.test(gen2$base_total)

## 
##  Shapiro-Wilk normality test
## 
## data:  gen2$base_total
## W = 0.98092, p-value = 0.1566

shapiro.test(gen3$base_total)

## 
##  Shapiro-Wilk normality test
## 
## data:  gen3$base_total
## W = 0.96794, p-value = 0.002843

shapiro.test(gen4$base_total)

## 
##  Shapiro-Wilk normality test
## 
## data:  gen4$base_total
## W = 0.96918, p-value = 0.01369

shapiro.test(gen5$base_total)

## 
##  Shapiro-Wilk normality test
## 
## data:  gen5$base_total
## W = 0.94298, p-value = 6.067e-06

shapiro.test(gen6$base_total)

## 
##  Shapiro-Wilk normality test
## 
## data:  gen6$base_total
## W = 0.96647, p-value = 0.05173

shapiro.test(gen7$base_total)

## 
##  Shapiro-Wilk normality test
## 
## data:  gen7$base_total
## W = 0.95538, p-value = 0.007093

Hypothesis 2: Testing

As we can only reject the hypothesis, as by boxplot above we can visualise the mean of base total of all 7 generation pokemon are same or not.

Kruskal test

Null Hypothesis: Mean of base total of Pokemons all generation is Not same.
Alternate Hypothesis: Mean of base total of Pokemons all generation is same.

kruskal.test(pokemon$generation~pokemon$base_total)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  pokemon$generation by pokemon$base_total
## Kruskal-Wallis chi-squared = 274.83, df = 202, p-value = 0.0004941

As P value is very less so there is no significant diffeence between the Base total of any generation

Conclusion

Average base total of Pokemons in each generation is same, it means Pokemons are not getting better by generations.

Hypthesis 3:

Is Bigger the better, always?

To check weather the bigger pokemons always perform better or the smaller pokemons are equally better we’ll be dividing pokemons on three different height category

Small Pokemon: Ranging from 0 to 0.5 meters of height
Medium pokemon: Ranging from 0.5 to 1 meters of height
Big Pokemons: All pokemons bigger than 1 meters of height

Subsetting pokemons based on their size

Smallpok<- na.omit(pokemon[pokemon$height_m <= 0.7,])
Medpok<- na.omit(pokemon[pokemon$height_m > 0.7 & poknum$height_m <= 1.4,])
Bigpok<- na.omit(pokemon[pokemon$height_m > 1.4,])

Hypothesis 3: Boxplot

boxplot(Smallpok$base_total, Medpok$base_total, Bigpok$base_total, names = c("Small pokemon", "Medium Pokemon", "Big Pokemon"), col = "#8BC3B6", xlab="Pokemon Size Category", ylab="Base Total", main="Pokemon Size Vs Base Total")

stripchart(Smallpok$base_total, Medpok$base_total, Bigpok$base_total, vertical = TRUE, method = "jitter", add = TRUE, pch = 20, col = "red")

As we can see above the big pokemons base total seems more than the other two category so before we proceed we will first test for normality distributions.

Hypothesis2: Normality test by Histogram

par(mfrow=c(2,2))
hist(Smallpok$base_total, main = "Small Pokemon", col = "#8BC3B6", xlab = "Base Total")
hist(Medpok$base_total, main = "Medium Pokemon", col = "#8BC3B6", xlab = "Base Total")
hist(Bigpok$base_total, main = "Large Pokemon", col = "#8BC3B6", xlab = "Base Total")
hist(rnorm(5000,mean=500,sd=50), main = "Normal Control",  col="#A6B91A", xlab = "Normal function")

##### Hypothesis3: Normality test by qqplot

par(mfrow=c(2,2))
qqnorm(Smallpok$base_total, col = "#8BC3B6", main = "Small pokemon" )
qqline(Smallpok$base_total)

qqnorm(Medpok$base_total, col="#8BC3B6", main = "Medium Pokemon")
qqline(Medpok$base_total)

qqnorm(Bigpok$base_total, col = "#8BC3B6", main = "Big pokemon")
qqline(Bigpok$base_total)

qqnorm(rnorm(5000,mean=500,sd=50),  col="#A6B91A", main = "Normal Function")
qqline(rnorm(5000,mean=500,sd=50), col= "red")

##### Hypothesis3: Normality test by Shapiro test

Null Hypothesis: Dataset is Normally Distributed

shapiro.test(Smallpok$base_total)

## 
##  Shapiro-Wilk normality test
## 
## data:  Smallpok$base_total
## W = 0.94951, p-value = 4.46e-08

shapiro.test(Medpok$base_total)

## 
##  Shapiro-Wilk normality test
## 
## data:  Medpok$base_total
## W = 0.97917, p-value = 0.0008554

shapiro.test(Bigpok$base_total)

## 
##  Shapiro-Wilk normality test
## 
## data:  Bigpok$base_total
## W = 0.95806, p-value = 0.0001116

P values aren’t that significant so we can say our data is not normally distributed

Hypothesis 3: Testing

As we can only reject the hypothesis, as by boxplot above we can visualise the mean of base total of Big pokemons is greater then that of Medium and smaller pokemons so we can say.

Wilcoxon test

Null Hypothesis: Mean of base total of Big pokemons is lesser.
Alternate Hypothesis: Mean of base total of Big pokemons is greater.

wilcox.test(Bigpok$base_total, Medpok$base_total, alternative = "greater")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Bigpok$base_total and Medpok$base_total
## W = 32841, p-value < 2.2e-16
## alternative hypothesis: true location shift is greater than 0

P value is very less so we can eject our hypothesis also,

wilcox.test(Medpok$base_total, Smallpok$base_total, alternative = "greater")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Medpok$base_total and Smallpok$base_total
## W = 58828, p-value < 2.2e-16
## alternative hypothesis: true location shift is greater than 0

Here to, P value is very less so we can eject our hypothesis

Conclusion

By above tests we can say that Big Pokemons are better than Medium and Smaller pokemons.

Testing corelations

We can also check individual qualities of a Pokemon weather with increase with height other qualities increases or not

Does great attack comes with good height?

par(mfrow=c(2,2))

plot(Smallpok$height_m, Smallpok$base_total,col="#8BC3B6", ylim = c(0, max(pokemon$base_total)), xlim = c(0, 5), , xlab = "Height", ylab = "Base Total", main = "Height vs Base Total")
points(Medpok$height_m, Medpok$base_total, col="#A6B91A")
points(Bigpok$height_m, Bigpok$base_total, col="#705746")
legend(3.5, 200, legend=c("Small pokemons", "Medium pokemon", "Big pokemon"),
       col=c("#8BC3B6", "#A6B91A","#705746"), lty=1:2, cex=0.8)

plot(Smallpok$height_m, Smallpok$attack,col="#8BC3B6", ylim = c(0, max(pokemon$attack)), xlim = c(0, 5), xlab = "Height", ylab = "Attck", main = "Height vs Attack")
points(Medpok$height_m, Medpok$attack, col="#A6B91A")
points(Bigpok$height_m, Bigpok$attack, col="#705746")
legend(3.5, 40, legend=c("Small pokemons", "Medium pokemon", "Big pokemon"),
       col=c("#8BC3B6", "#A6B91A","#705746"), lty=1:2, cex=0.8)

plot(Smallpok$height_m, Smallpok$defense ,col="#8BC3B6", ylim = c(0, max(pokemon$attack)), xlim = c(0, 5),, xlab = "Height", ylab = "Defence", main = "Height vs Defence")
points(Medpok$height_m, Medpok$defense, col="#A6B91A")
points(Bigpok$height_m, Bigpok$defense, col="#705746")
legend(3.5, 40, legend=c("Small pokemons", "Medium pokemon", "Big pokemon"),
       col=c("#8BC3B6", "#A6B91A","#705746"), lty=1:2, cex=0.8)


plot(Smallpok$height_m, Smallpok$weight_kg,col="#8BC3B6", ylim = c(0, max(pokemon$attack)), xlim = c(0, 5), xlab = "Height", ylab = "Weight in Kg", main = "Height vs Weight")
points(Medpok$height_m, Medpok$weight_kg, col="#A6B91A")
points(Bigpok$height_m, Bigpok$weight_kg, col="#705746")
legend(3.5, 40, legend=c("Small pokemons", "Medium pokemon", "Big pokemon"),
       col=c("#8BC3B6", "#A6B91A","#705746"), lty=1:2, cex=0.8)

By above plots it’s seems Height is positively related to these properties so we’ll do corelation test.

Corelation test

cor.test(pokemon$height_m, pokemon$base_total)

## 
##  Pearson's product-moment correlation
## 
## data:  pokemon$height_m and pokemon$base_total
## t = 17.677, df = 779, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4830419 0.5833202
## sample estimates:
##       cor 
## 0.5350631

cor.test(pokemon$height_m, pokemon$attack)

## 
##  Pearson's product-moment correlation
## 
## data:  pokemon$height_m and pokemon$attack
## t = 13.035, df = 779, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3638075 0.4790907
## sample estimates:
##       cor 
## 0.4231602

cor.test(pokemon$height_m, pokemon$defense)

## 
##  Pearson's product-moment correlation
## 
## data:  pokemon$height_m and pokemon$defense
## t = 10.837, df = 779, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2993865 0.4213907
## sample estimates:
##       cor 
## 0.3619375

cor.test(pokemon$height_m, pokemon$weight_kg)

## 
##  Pearson's product-moment correlation
## 
## data:  pokemon$height_m and pokemon$weight_kg
## t = 22.438, df = 779, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5819789 0.6673701
## sample estimates:
##       cor 
## 0.6265511

Conclusion

Cor Values are positive for all the data so we can say With Increase in height Base total, Attack, Defence and Weight also increases.

Hypothesis 4

Are Bigger and Heavier Pokemons hard to capture?

As size and weight increases will it get tougher to capture a pokemon?

boxplot

boxplot(as.numeric(Smallpok$capture_rate), as.numeric(Medpok$capture_rate), as.numeric(Bigpok$capture_rate),col = "#8BC3B6", main="Size vs Capture Rate", names = c("Small Pokemon", "Medium Pokemon", "Big Pokemon"), xlab="Pokemon Size", ylab="Capture Rate")

By above we can see Smaller pokemons are easier to capture, so to check our hypothesis we’ll first do normality tests

Hypothesis4: Normality test by Histogram

par(mfrow=c(2,2))
hist(as.numeric(Smallpok$capture_rate), main = "Small Pokemon", col = "#8BC3B6", xlab = "Capture Rate")
hist(as.numeric(Medpok$capture_rate), main = "Medium Pokemon", col = "#8BC3B6", xlab = "Capture Rate")
hist(as.numeric(Bigpok$capture_rate), main = "Large Pokemon", col = "#8BC3B6", xlab = "Capture Rate")
hist(rnorm(5000,mean=500,sd=50), main = "Normal Control",  col="#A6B91A", xlab = "Normal function")

##### Hypothesis4: Normality test by qqplot

par(mfrow=c(2,2))
qqnorm(as.numeric(Smallpok$capture_rate), col = "#8BC3B6", main = "Small pokemon" )
qqline(as.numeric(Smallpok$capture_rate))

qqnorm(as.numeric(Medpok$capture_rate), col="#8BC3B6", main = "Medium Pokemon")
qqline(as.numeric(Medpok$capture_rate))

qqnorm(as.numeric(Bigpok$capture_rate), col = "#8BC3B6", main = "Big pokemon")
qqline(as.numeric(Bigpok$capture_rate))

qqnorm(rnorm(5000,mean=500,sd=50),  col="#A6B91A", main = "Normal Function")
qqline(rnorm(5000,mean=500,sd=50), col= "red")

##### Hypothesis4: Normality test by Shapiro test

Null Hypothesis: Dataset is Normally Distributed

shapiro.test(as.numeric(Smallpok$capture_rate))

## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(Smallpok$capture_rate)
## W = 0.88356, p-value = 1.445e-13

shapiro.test(as.numeric(Medpok$capture_rate))

## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(Medpok$capture_rate)
## W = 0.77337, p-value < 2.2e-16

shapiro.test(as.numeric(Bigpok$capture_rate))

## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(Bigpok$capture_rate)
## W = 0.69063, p-value < 2.2e-16

P values aren’t that significant so we can say our data is not normally distributed

Hypothesis 4: Testing

As we can only reject the hypothesis, as by boxplot above we can visualise the mean of Capture rate of Smaller pokemons is greater then that of Medium and Bigger pokemons so we can say.

Wilcoxon test

Null Hypothesis: Mean of Capture rate of Small pokemons is lesser
Alternate Hypothesis: Mean of Capture rate of Small pokemons is greater.

wilcox.test(as.numeric(Smallpok$capture_rate), as.numeric(Medpok$capture_rate), alternative = "greater")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  as.numeric(Smallpok$capture_rate) and as.numeric(Medpok$capture_rate)
## W = 51307, p-value < 2.2e-16
## alternative hypothesis: true location shift is greater than 0

P value is very less so we can reject our hypothesis also,

wilcox.test(as.numeric(Medpok$capture_rate), as.numeric(Bigpok$capture_rate), alternative = "greater")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  as.numeric(Medpok$capture_rate) and as.numeric(Bigpok$capture_rate)
## W = 26372, p-value = 7.849e-09
## alternative hypothesis: true location shift is greater than 0

Here to, P value is very less so we can eject our hypothesis

Conclusion

By above tests we can say that Smaller Pokemons are easier to capture than that of Medium and Smaller pokemons.

Hypothesis 5

Are Fighting pokemons better in attack and defence?

Extracting fighter pokemon from pokemon dataset

fighterpok<- pokemon[pokemon$type1=="fighting" | pokemon$type2=="fighting",]
nonfightpok<- pokemon[pokemon$type1!="fighting" | pokemon$type2!="fighting",]

Hypothesis5: Boxplot

par(mfrow=c(1,2))

boxplot(fighterpok$base_total, nonfightpok$base_total, main="Base Total", names = c("Fighting","Non Fighting"), col = c("#8BC3B6","#8BC3B6"), xlab = "Fighting Type", ylab = "Base Total")

boxplot(fighterpok$attack, nonfightpok$attack, main="Attack", names = c("Fighting","Non Fighting"), col = c("#8BC3B6","#8BC3B6"), xlab = "Fighting Type", ylab = "Attack")

So by above boxplot it seems that base total and Attack of Fighting pokemon is higher than the non-fighting ones.

To proceed further we have to check normality first by following methods

Hypothesis5: Normality test by Histogram

par(mfrow=c(2,3))

hist(fighterpok$base_total, main = "Fighter Pokemons base total", xlab = "Base Total of Fighting pokemons", col= "#8BC3B6")

hist(nonfightpok$base_total, main = "Non LFighter Pokemons base total", xlab = "Base Total of Non Fighting pokemons", col= "#8BC3B6")

hist(fighterpok$attack, main = "Fighting Pokemons attack", xlab = "Base Total of Fighting pokemons", col= "#8BC3B6")

hist(nonfightpok$attack, main = "Non Fighting Pokemons attack", xlab = "Base Total of Non Fighting pokemons", col= "#8BC3B6")

hist(rnorm(5000,mean=500,sd=50), main = "Normal control", xlab = "Normal function", col="#A6B91A")

By the above histogram the base total and Attack data for Fighting and Non Fighting doesn’t seem to be a normally distributed

Hypothesis5: Normality test by qqplot

par(mfrow=c(2,3))
qqnorm(fighterpok$base_total, col = "#8BC3B6", main = "Fighting Pokemon Base total")
qqline(fighterpok$base_total)

qqnorm(nonfightpok$base_total, col="#8BC3B6", main = "Non Fighting Pokemon Base total")
qqline(nonfightpok$base_total)

qqnorm(fighterpok$attack, col="#8BC3B6", main = "Fighting Pokemon Attack")
qqline(fighterpok$attack)

qqnorm(nonfightpok$attack, col="#8BC3B6", main = "Fighting Pokemon Attack")
qqline(nonfightpok$attack)

qqnorm(rnorm(5000,mean=500,sd=50), col="#A6B91A", main = "Normal Control")
qqline(rnorm(5000,mean=500,sd=50))

By the above qqplots the base total and Attack data for Fighting and Non fighting doesn’t seem to be a normally distributed

Hypothesis5: Normality test by Shapiro test

Null Hypothesis: Dataset is Normally Distributed

shapiro.test(fighterpok$base_total)

## 
##  Shapiro-Wilk normality test
## 
## data:  fighterpok$base_total
## W = 0.96005, p-value = 0.07379

P value is significant so we can reject our Null Hypotheis, mean base total for fighting pokemons is normally distributed.

shapiro.test(nonfightpok$base_total)

## 
##  Shapiro-Wilk normality test
## 
## data:  nonfightpok$base_total
## W = 0.98024, p-value = 6.195e-09

shapiro.test(nonfightpok$attack)

## 
##  Shapiro-Wilk normality test
## 
## data:  nonfightpok$attack
## W = 0.97948, p-value = 3.581e-09

shapiro.test(nonfightpok$attack)

## 
##  Shapiro-Wilk normality test
## 
## data:  nonfightpok$attack
## W = 0.97948, p-value = 3.581e-09

P value is not significant so we can reject our Null Hypotheis, means base total for Non fighter Pokemons isn’t normally distributed.

Hypothesis 5: Testing

As we can only reject the hypothesis, as by boxplot above we can visualise the mean of base total and attack of Fighting pokemons are greater then that of Non Fighting pokemons so we can say. ##### Wilcoxon test * Null Hypothesis: Mean of base total and attack of Fighting pokemons is lesser. * Alternate Hypothesis: Mean of base total and attack of Fighting pokemons is greater.

wilcox.test(fighterpok$base_total, nonfightpok$base_total, alternative = "greater")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  fighterpok$base_total and nonfightpok$base_total
## W = 24521, p-value = 0.02912
## alternative hypothesis: true location shift is greater than 0

wilcox.test(fighterpok$attack, nonfightpok$attack, alternative = "greater")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  fighterpok$attack and nonfightpok$attack
## W = 30916, p-value = 1.251e-08
## alternative hypothesis: true location shift is greater than 0

As we can see the P values are very less so we can reject our hypothesis.

Conclusion

By above tests we can say that Fighting pokemons are better in attacking and in Base total than non fighting pokemons

Govind Prakash Assignment 2 (Pokemon Dataset)

Govind Prakash

20/02/2021

Exploring Pokemon dataset:

Finding more details about our dataset:

Finding patterns:

Setting working directory

Importing pokemon dataset

Importing libraries

Basic Properties of Pokemon dataset

Checking properties of Pokemon dataset

Checking other properties

Coorelation and Distribution of Variables

Number of pokemons by type

Hypothesis 1:

Are Legendary Pokemons better then Non- Legendary ones?

Extracting Legendary and Non legendary pokemons

Plotting Pie chart

Checking Base total of Legendary vs non Legendary cards

Hypothesis1: Box Plot

Hypothesis1: Normality test by Histogram

Hypothesis1: Normality test by qqplot

Hypothesis1: Normality test by Shapiro test

Hypothesis 1: Testing

Conclusion

Hypthesis 2:

Do Pokemons improve with generations?

Test 1: Hypothesis 2

Pokemons distribution by generation

Pie chart for Legendaries

Are legendary pokemons increasing with generations?

Boxplot

Hypothesis2: Normality test by Histogram

Hypothesis2: Normality test by qqplot

Hypothesis2: Normality test by Shapiro test

Hypothesis 2: Testing

Kruskal test

Conclusion

Hypthesis 3:

Is Bigger the better, always?

Subsetting pokemons based on their size

Hypothesis 3: Boxplot

Hypothesis2: Normality test by Histogram

Hypothesis 3: Testing

Wilcoxon test

Conclusion

Testing corelations

Does great attack comes with good height?

Corelation test

Conclusion

Hypothesis 4

Are Bigger and Heavier Pokemons hard to capture?

boxplot

Hypothesis4: Normality test by Histogram

Hypothesis 4: Testing

Wilcoxon test

Conclusion

Hypothesis 5

Are Fighting pokemons better in attack and defence?

Hypothesis5: Boxplot

Hypothesis5: Normality test by Histogram

Hypothesis5: Normality test by qqplot

Hypothesis5: Normality test by Shapiro test

Hypothesis 5: Testing

Conclusion

————————————- The End ———————————————-

Govind Prakash

17096

————————————- The End ———————————————-