I’ll be dealing with following Hypothesises
Hypothesis Q1: Are Legendary Pokemons better then Non- Legendary ones?
Hypothesis Q2: Did Pokemons get better with generations?
Hypothesis Q3: Is Bigger the better, always? Does size matter for pokemons?
Hypothesis Q4: Are Bigger and Heavier Pokemons hard to capture?
Hypothesis Q5: Are Fighting pokemons better in attack and defence?
getwd()
## [1] "X:/1.Study/4th year semester 2/Biostat/Assignment 2"
setwd("X:/1.Study/4th year semester 2/Biostat/Assignment 2")
getwd()
## [1] "X:/1.Study/4th year semester 2/Biostat/Assignment 2"
pokemon<- read.csv("pokemon.csv", header = T)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(GGally)
library(tidyr)
library(ggthemes)
library(gridExtra)
library(cowplot)
dim(pokemon)
## [1] 801 41
So there are 801 rows and 4 columns
str(pokemon)
So there are a lot of numeric variables in the datasets, and if we want to check corelation between two variables so they must be numeric, so we will select out numeric variables only.
poknum<- select_if(pokemon, is.numeric)
ggpairs(pokemon, columns = c('attack', 'defense', 'hp', 'sp_attack', 'sp_defense', 'speed'), col="red") +
theme_bw() +
labs(title = 'Correlation Matrix of Pokemon Stats')
type1num<- pokemon %>%
group_by(type1) %>%
summarise(number = n()) %>%
ggplot(aes(x = reorder(type1, number), y = number , fill = type1)) +
geom_bar(stat = 'identity') +
xlab(label = "Type of Pokemon") +
ylab(label = "Number of Pokemon") +
ggtitle(label = "Number of Pokemon Type 1") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position="none") +
coord_flip() +
geom_text(aes(label = number), hjust = -1.0)
## `summarise()` ungrouping output (override with `.groups` argument)
type2num<- pokemon %>%
group_by(type2) %>%
summarise(number = n()) %>%
ggplot(aes(x = reorder(type2, number), y = number , fill = type2)) +
geom_bar(stat = 'identity') +
xlab(label = "Type of Pokemon") +
ylab(label = "Number of Pokemon") +
ggtitle(label = "Number of Pokemon Type 2") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position="none") +
coord_flip() +
geom_text(aes(label = number), hjust = -1.0)
## `summarise()` ungrouping output (override with `.groups` argument)
plot_grid(type1num, type2num, labels=c("A", "B"), ncol = 2, nrow = 1)
As per Name suggests Legendary ones must be more powerful then the non legendary ones, so we will be checking our hypothesis in this assignment.
is_legendary<- pokemon[pokemon$is_legendary==1, ]
not_legendary<- pokemon[pokemon$is_legendary==0, ]
legendary <- data.frame(
Type = c("Legendary", "Not Legendary"),
Numbers = c(nrow(is_legendary), nrow(not_legendary))
)
head(legendary)
## Type Numbers
## 1 Legendary 70
## 2 Not Legendary 731
ggplot(legendary, aes(x="", y= Numbers, fill= Type))+
geom_bar(width = 1, stat = "identity")+
coord_polar("y", start=0)+
scale_fill_brewer(palette="Blues")+
theme_minimal()
By above pie chart we can say Non legendary ones cover a large fraction of Pokemon population.
If Legendary pokemons are really better their base total must be higher than the non legendary ones, as base total is the average of attack, defence and many other parameters so base total will be a good parameter to judge the power of a pokemon.
boxplot(pokemon$base_total~pokemon$is_legendary,main="Base total of Legendary vs Non Legendary Pokemons", names = c("Not Legendary","Legendary"), col = c("#A6B91A", "#A6B91A"), xlab = "Is_legendary", ylab = "Base Total")
stripchart(pokemon$base_total~pokemon$is_legendary, vertical = TRUE, method = "jitter", add = TRUE, pch = 20, col = c("#077eff", "#077eff"))
So by above boxplot it seems that base total of Legendary pokemon is higher than the non-legendary ones.
To proceed further we have to check normality first by following methods
par(mfrow=c(1,3))
hist(is_legendary$base_total, main = "Legendary Pokemons", xlab = "Base Total of Legendary pokemons", col= "#8BC3B6")
hist(not_legendary$base_total, main = "Non Legendary Pokemons", xlab = "Base Total of Non Legendary pokemons", col= "#8BC3B6")
hist(rnorm(5000,mean=500,sd=50), main = "Normal control", xlab = "Normal function", col="#A6B91A")
By the above histogram the base total data for Legendary and Non legendary doesn’t seem to be a normally distributed
par(mfrow=c(1,3))
qqnorm(is_legendary$base_total, col = "#8BC3B6", main = "Legendary Pokemon")
qqline(is_legendary$base_total)
qqnorm(not_legendary$base_total, col="#8BC3B6", main = "Non Legendary")
qqline(not_legendary$base_total)
qqnorm(rnorm(5000,mean=500,sd=50), col="#A6B91A", main = "Normal Control")
qqline(rnorm(5000,mean=500,sd=50))
By the above qqplots the base total data for Legendary and Non legendary doesn’t seem to be a normally distributed
shapiro.test(is_legendary$base_total)
##
## Shapiro-Wilk normality test
##
## data: is_legendary$base_total
## W = 0.76868, p-value = 3.834e-09
P value is not significant so we can reject our Null Hypotheis, mean base total for Legendary pokemons isn’t normally distributed.
shapiro.test(not_legendary$base_total)
##
## Shapiro-Wilk normality test
##
## data: not_legendary$base_total
## W = 0.97442, p-value = 5.301e-10
P value is not significant so we can reject our Null Hypotheis, mean base total for Non Legendary Pokemons isn’t normally distributed.
As we can only reject the hypothesis, as by boxplot above we can visualise the mean of base total of Legendary pokemons is greater then that of Non legendary pokemons so we can say. ##### Wilcoxon test * Null Hypothesis: Mean of base total of Legendary pokemons is lesser. * Alternate Hypothesis: Mean of base total of Legendary pokemons is greater.
wilcox.test(is_legendary$base_total, not_legendary$base_total, alternative = "greater")
##
## Wilcoxon rank sum test with continuity correction
##
## data: is_legendary$base_total and not_legendary$base_total
## W = 48475, p-value < 2.2e-16
## alternative hypothesis: true location shift is greater than 0
As we can see the P value is very less so we can reject our hypothesis.
By above tests we can say that Legendary pokemons are better than non legendary pokemons
gen1<- poknum[poknum$generation== 1,]
gen2<- poknum[poknum$generation== 2,]
gen3<- poknum[poknum$generation== 3,]
gen4<- poknum[poknum$generation== 4,]
gen5<- poknum[poknum$generation== 5,]
gen6<- poknum[poknum$generation== 6,]
gen7<- poknum[poknum$generation== 7,]
Generations <- data.frame(
Generation = c("Generation 1", "Generation 2", "Generation 3","Generation 4", "Generation 5", "Generation 6", "Generation 7"),
Numbers = c(nrow(gen1), nrow(gen2), nrow(gen3), nrow(gen4), nrow(gen5), nrow(gen6), nrow(gen7))
)
ggplot(Generations, aes(x="", y= Numbers, fill= Generation))+
geom_bar(width = 1, stat = "identity")+
coord_polar("y", start=0)+
scale_fill_brewer(palette="Blues")+
theme_minimal()
As we have seen above Lgendary pokemons are better than non legendary ones so if Pokemons are getting better with generations we should also see more legendary pokemons with generations.
pokemon_edit <- pokemon # Just so we dont mess up dataset by mistake
pokemon_edit$is_legendary<-recode(pokemon_edit$is_legendary, "0" = "No", "1" ="Yes")
ggplot(pokemon_edit, aes(x = generation, fill = is_legendary)) +
geom_bar() + ggtitle("Pokemon Generation wise Distribution frequency")
par(mfrow=c(2,2))
boxplot(pokemon$base_total~pokemon$generation, main="Base total vs Generation", xlab = "Generation", ylab = "Basetotal", col= "#8BC3B6")
boxplot(pokemon$attack~pokemon$generation, main="Attack vs Generation", xlab = "Generation", ylab = "Attack", col= "#8BC3B6")
boxplot(pokemon$defense~pokemon$generation, main="Defence vs Generation", xlab = "Generation", ylab = "Defence", col= "#8BC3B6")
boxplot(pokemon$weight_kg~pokemon$generation, main="Weight vs Generation", xlab = "Generation", ylab = "Weight", col= "#8BC3B6")
I dont see any significant difference between any generations so to proceed with our hypothesis we will take base total as our measuring criteria.
boxplot(pokemon$base_total~pokemon$generation, main="Base total vs Generation", xlab = "Generation", ylab = "Basetotal", col= "#8BC3B6")
stripchart(pokemon$base_total~pokemon$generation, vertical = TRUE, method = "jitter", add = TRUE, pch = 20, col = c("#A6B91A","#705746","#6F35FC","#F7D02C","#D685AD","#C22E28",
"#EE8130"))
Before we proceed checking our hypothesis we will first do normality test for the above
par(mfrow=c(3,3))
hist(gen1$base_total, main = "Generation 1", xlab = "Base Total", col="#8BC3B6")
hist(gen2$base_total, main = "Generation 2", xlab = "Base Total", col="#8BC3B6")
hist(gen3$base_total, main = "Generation 3", xlab = "Base Total", col="#8BC3B6")
hist(gen4$base_total, main = "Generation 4", xlab = "Base Total", col="#8BC3B6")
hist(gen5$base_total, main = "Generation 5", xlab = "Base Total", col="#8BC3B6")
hist(gen6$base_total, main = "Generation 6", xlab = "Base Total", col="#8BC3B6")
hist(gen7$base_total, main = "Generation 7", xlab = "Base Total", col="#8BC3B6")
hist(rnorm(5000,mean=500,sd=50), main = "Normal control", xlab = "Normal function", col="#A6B91A")
par(mfrow=c(3,3))
qqnorm(gen1$base_total, col = "#8BC3B6", main = "Generation 1" )
qqline(gen1$base_total)
qqnorm(gen2$base_total, col="#8BC3B6", main = "Generation 2")
qqline(gen2$base_total)
qqnorm(gen3$base_total, col = "#8BC3B6", main = "Generation 3")
qqline(gen3$base_total)
qqnorm(gen4$base_total, col="#8BC3B6", main = "Generation 4")
qqline(gen4$base_total)
qqnorm(gen5$base_total, col = "#8BC3B6", main = "Generation 5")
qqline(gen5$base_total)
qqnorm(gen6$base_total, col="#8BC3B6", main = "Generation 6")
qqline(gen6$base_total)
qqnorm(gen7$base_total, col = "#8BC3B6", main = "Generation 7")
qqline(gen7$base_total)
qqnorm(rnorm(5000,mean=500,sd=50), col="#A6B91A", main = "Normal Control")
qqline(rnorm(5000,mean=500,sd=50))
shapiro.test(gen1$base_total)
##
## Shapiro-Wilk normality test
##
## data: gen1$base_total
## W = 0.97494, p-value = 0.007344
shapiro.test(gen2$base_total)
##
## Shapiro-Wilk normality test
##
## data: gen2$base_total
## W = 0.98092, p-value = 0.1566
shapiro.test(gen3$base_total)
##
## Shapiro-Wilk normality test
##
## data: gen3$base_total
## W = 0.96794, p-value = 0.002843
shapiro.test(gen4$base_total)
##
## Shapiro-Wilk normality test
##
## data: gen4$base_total
## W = 0.96918, p-value = 0.01369
shapiro.test(gen5$base_total)
##
## Shapiro-Wilk normality test
##
## data: gen5$base_total
## W = 0.94298, p-value = 6.067e-06
shapiro.test(gen6$base_total)
##
## Shapiro-Wilk normality test
##
## data: gen6$base_total
## W = 0.96647, p-value = 0.05173
shapiro.test(gen7$base_total)
##
## Shapiro-Wilk normality test
##
## data: gen7$base_total
## W = 0.95538, p-value = 0.007093
As we can only reject the hypothesis, as by boxplot above we can visualise the mean of base total of all 7 generation pokemon are same or not.
kruskal.test(pokemon$generation~pokemon$base_total)
##
## Kruskal-Wallis rank sum test
##
## data: pokemon$generation by pokemon$base_total
## Kruskal-Wallis chi-squared = 274.83, df = 202, p-value = 0.0004941
As P value is very less so there is no significant diffeence between the Base total of any generation
Average base total of Pokemons in each generation is same, it means Pokemons are not getting better by generations.
To check weather the bigger pokemons always perform better or the smaller pokemons are equally better we’ll be dividing pokemons on three different height category
Small Pokemon: Ranging from 0 to 0.5 meters of height
Medium pokemon: Ranging from 0.5 to 1 meters of height
Big Pokemons: All pokemons bigger than 1 meters of height
Smallpok<- na.omit(pokemon[pokemon$height_m <= 0.7,])
Medpok<- na.omit(pokemon[pokemon$height_m > 0.7 & poknum$height_m <= 1.4,])
Bigpok<- na.omit(pokemon[pokemon$height_m > 1.4,])
boxplot(Smallpok$base_total, Medpok$base_total, Bigpok$base_total, names = c("Small pokemon", "Medium Pokemon", "Big Pokemon"), col = "#8BC3B6", xlab="Pokemon Size Category", ylab="Base Total", main="Pokemon Size Vs Base Total")
stripchart(Smallpok$base_total, Medpok$base_total, Bigpok$base_total, vertical = TRUE, method = "jitter", add = TRUE, pch = 20, col = "red")
As we can see above the big pokemons base total seems more than the other two category so before we proceed we will first test for normality distributions.
par(mfrow=c(2,2))
hist(Smallpok$base_total, main = "Small Pokemon", col = "#8BC3B6", xlab = "Base Total")
hist(Medpok$base_total, main = "Medium Pokemon", col = "#8BC3B6", xlab = "Base Total")
hist(Bigpok$base_total, main = "Large Pokemon", col = "#8BC3B6", xlab = "Base Total")
hist(rnorm(5000,mean=500,sd=50), main = "Normal Control", col="#A6B91A", xlab = "Normal function")
##### Hypothesis3: Normality test by qqplot
par(mfrow=c(2,2))
qqnorm(Smallpok$base_total, col = "#8BC3B6", main = "Small pokemon" )
qqline(Smallpok$base_total)
qqnorm(Medpok$base_total, col="#8BC3B6", main = "Medium Pokemon")
qqline(Medpok$base_total)
qqnorm(Bigpok$base_total, col = "#8BC3B6", main = "Big pokemon")
qqline(Bigpok$base_total)
qqnorm(rnorm(5000,mean=500,sd=50), col="#A6B91A", main = "Normal Function")
qqline(rnorm(5000,mean=500,sd=50), col= "red")
##### Hypothesis3: Normality test by Shapiro test
shapiro.test(Smallpok$base_total)
##
## Shapiro-Wilk normality test
##
## data: Smallpok$base_total
## W = 0.94951, p-value = 4.46e-08
shapiro.test(Medpok$base_total)
##
## Shapiro-Wilk normality test
##
## data: Medpok$base_total
## W = 0.97917, p-value = 0.0008554
shapiro.test(Bigpok$base_total)
##
## Shapiro-Wilk normality test
##
## data: Bigpok$base_total
## W = 0.95806, p-value = 0.0001116
P values aren’t that significant so we can say our data is not normally distributed
As we can only reject the hypothesis, as by boxplot above we can visualise the mean of base total of Big pokemons is greater then that of Medium and smaller pokemons so we can say.
wilcox.test(Bigpok$base_total, Medpok$base_total, alternative = "greater")
##
## Wilcoxon rank sum test with continuity correction
##
## data: Bigpok$base_total and Medpok$base_total
## W = 32841, p-value < 2.2e-16
## alternative hypothesis: true location shift is greater than 0
P value is very less so we can eject our hypothesis also,
wilcox.test(Medpok$base_total, Smallpok$base_total, alternative = "greater")
##
## Wilcoxon rank sum test with continuity correction
##
## data: Medpok$base_total and Smallpok$base_total
## W = 58828, p-value < 2.2e-16
## alternative hypothesis: true location shift is greater than 0
Here to, P value is very less so we can eject our hypothesis
By above tests we can say that Big Pokemons are better than Medium and Smaller pokemons.
We can also check individual qualities of a Pokemon weather with increase with height other qualities increases or not
par(mfrow=c(2,2))
plot(Smallpok$height_m, Smallpok$base_total,col="#8BC3B6", ylim = c(0, max(pokemon$base_total)), xlim = c(0, 5), , xlab = "Height", ylab = "Base Total", main = "Height vs Base Total")
points(Medpok$height_m, Medpok$base_total, col="#A6B91A")
points(Bigpok$height_m, Bigpok$base_total, col="#705746")
legend(3.5, 200, legend=c("Small pokemons", "Medium pokemon", "Big pokemon"),
col=c("#8BC3B6", "#A6B91A","#705746"), lty=1:2, cex=0.8)
plot(Smallpok$height_m, Smallpok$attack,col="#8BC3B6", ylim = c(0, max(pokemon$attack)), xlim = c(0, 5), xlab = "Height", ylab = "Attck", main = "Height vs Attack")
points(Medpok$height_m, Medpok$attack, col="#A6B91A")
points(Bigpok$height_m, Bigpok$attack, col="#705746")
legend(3.5, 40, legend=c("Small pokemons", "Medium pokemon", "Big pokemon"),
col=c("#8BC3B6", "#A6B91A","#705746"), lty=1:2, cex=0.8)
plot(Smallpok$height_m, Smallpok$defense ,col="#8BC3B6", ylim = c(0, max(pokemon$attack)), xlim = c(0, 5),, xlab = "Height", ylab = "Defence", main = "Height vs Defence")
points(Medpok$height_m, Medpok$defense, col="#A6B91A")
points(Bigpok$height_m, Bigpok$defense, col="#705746")
legend(3.5, 40, legend=c("Small pokemons", "Medium pokemon", "Big pokemon"),
col=c("#8BC3B6", "#A6B91A","#705746"), lty=1:2, cex=0.8)
plot(Smallpok$height_m, Smallpok$weight_kg,col="#8BC3B6", ylim = c(0, max(pokemon$attack)), xlim = c(0, 5), xlab = "Height", ylab = "Weight in Kg", main = "Height vs Weight")
points(Medpok$height_m, Medpok$weight_kg, col="#A6B91A")
points(Bigpok$height_m, Bigpok$weight_kg, col="#705746")
legend(3.5, 40, legend=c("Small pokemons", "Medium pokemon", "Big pokemon"),
col=c("#8BC3B6", "#A6B91A","#705746"), lty=1:2, cex=0.8)
By above plots it’s seems Height is positively related to these properties so we’ll do corelation test.
cor.test(pokemon$height_m, pokemon$base_total)
##
## Pearson's product-moment correlation
##
## data: pokemon$height_m and pokemon$base_total
## t = 17.677, df = 779, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4830419 0.5833202
## sample estimates:
## cor
## 0.5350631
cor.test(pokemon$height_m, pokemon$attack)
##
## Pearson's product-moment correlation
##
## data: pokemon$height_m and pokemon$attack
## t = 13.035, df = 779, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3638075 0.4790907
## sample estimates:
## cor
## 0.4231602
cor.test(pokemon$height_m, pokemon$defense)
##
## Pearson's product-moment correlation
##
## data: pokemon$height_m and pokemon$defense
## t = 10.837, df = 779, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2993865 0.4213907
## sample estimates:
## cor
## 0.3619375
cor.test(pokemon$height_m, pokemon$weight_kg)
##
## Pearson's product-moment correlation
##
## data: pokemon$height_m and pokemon$weight_kg
## t = 22.438, df = 779, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5819789 0.6673701
## sample estimates:
## cor
## 0.6265511
Cor Values are positive for all the data so we can say With Increase in height Base total, Attack, Defence and Weight also increases.
As size and weight increases will it get tougher to capture a pokemon?
boxplot(as.numeric(Smallpok$capture_rate), as.numeric(Medpok$capture_rate), as.numeric(Bigpok$capture_rate),col = "#8BC3B6", main="Size vs Capture Rate", names = c("Small Pokemon", "Medium Pokemon", "Big Pokemon"), xlab="Pokemon Size", ylab="Capture Rate")
By above we can see Smaller pokemons are easier to capture, so to check our hypothesis we’ll first do normality tests
par(mfrow=c(2,2))
hist(as.numeric(Smallpok$capture_rate), main = "Small Pokemon", col = "#8BC3B6", xlab = "Capture Rate")
hist(as.numeric(Medpok$capture_rate), main = "Medium Pokemon", col = "#8BC3B6", xlab = "Capture Rate")
hist(as.numeric(Bigpok$capture_rate), main = "Large Pokemon", col = "#8BC3B6", xlab = "Capture Rate")
hist(rnorm(5000,mean=500,sd=50), main = "Normal Control", col="#A6B91A", xlab = "Normal function")
##### Hypothesis4: Normality test by qqplot
par(mfrow=c(2,2))
qqnorm(as.numeric(Smallpok$capture_rate), col = "#8BC3B6", main = "Small pokemon" )
qqline(as.numeric(Smallpok$capture_rate))
qqnorm(as.numeric(Medpok$capture_rate), col="#8BC3B6", main = "Medium Pokemon")
qqline(as.numeric(Medpok$capture_rate))
qqnorm(as.numeric(Bigpok$capture_rate), col = "#8BC3B6", main = "Big pokemon")
qqline(as.numeric(Bigpok$capture_rate))
qqnorm(rnorm(5000,mean=500,sd=50), col="#A6B91A", main = "Normal Function")
qqline(rnorm(5000,mean=500,sd=50), col= "red")
##### Hypothesis4: Normality test by Shapiro test
shapiro.test(as.numeric(Smallpok$capture_rate))
##
## Shapiro-Wilk normality test
##
## data: as.numeric(Smallpok$capture_rate)
## W = 0.88356, p-value = 1.445e-13
shapiro.test(as.numeric(Medpok$capture_rate))
##
## Shapiro-Wilk normality test
##
## data: as.numeric(Medpok$capture_rate)
## W = 0.77337, p-value < 2.2e-16
shapiro.test(as.numeric(Bigpok$capture_rate))
##
## Shapiro-Wilk normality test
##
## data: as.numeric(Bigpok$capture_rate)
## W = 0.69063, p-value < 2.2e-16
P values aren’t that significant so we can say our data is not normally distributed
As we can only reject the hypothesis, as by boxplot above we can visualise the mean of Capture rate of Smaller pokemons is greater then that of Medium and Bigger pokemons so we can say.
wilcox.test(as.numeric(Smallpok$capture_rate), as.numeric(Medpok$capture_rate), alternative = "greater")
##
## Wilcoxon rank sum test with continuity correction
##
## data: as.numeric(Smallpok$capture_rate) and as.numeric(Medpok$capture_rate)
## W = 51307, p-value < 2.2e-16
## alternative hypothesis: true location shift is greater than 0
P value is very less so we can reject our hypothesis also,
wilcox.test(as.numeric(Medpok$capture_rate), as.numeric(Bigpok$capture_rate), alternative = "greater")
##
## Wilcoxon rank sum test with continuity correction
##
## data: as.numeric(Medpok$capture_rate) and as.numeric(Bigpok$capture_rate)
## W = 26372, p-value = 7.849e-09
## alternative hypothesis: true location shift is greater than 0
Here to, P value is very less so we can eject our hypothesis
By above tests we can say that Smaller Pokemons are easier to capture than that of Medium and Smaller pokemons.
Extracting fighter pokemon from pokemon dataset
fighterpok<- pokemon[pokemon$type1=="fighting" | pokemon$type2=="fighting",]
nonfightpok<- pokemon[pokemon$type1!="fighting" | pokemon$type2!="fighting",]
par(mfrow=c(1,2))
boxplot(fighterpok$base_total, nonfightpok$base_total, main="Base Total", names = c("Fighting","Non Fighting"), col = c("#8BC3B6","#8BC3B6"), xlab = "Fighting Type", ylab = "Base Total")
boxplot(fighterpok$attack, nonfightpok$attack, main="Attack", names = c("Fighting","Non Fighting"), col = c("#8BC3B6","#8BC3B6"), xlab = "Fighting Type", ylab = "Attack")
So by above boxplot it seems that base total and Attack of Fighting pokemon is higher than the non-fighting ones.
To proceed further we have to check normality first by following methods
par(mfrow=c(2,3))
hist(fighterpok$base_total, main = "Fighter Pokemons base total", xlab = "Base Total of Fighting pokemons", col= "#8BC3B6")
hist(nonfightpok$base_total, main = "Non LFighter Pokemons base total", xlab = "Base Total of Non Fighting pokemons", col= "#8BC3B6")
hist(fighterpok$attack, main = "Fighting Pokemons attack", xlab = "Base Total of Fighting pokemons", col= "#8BC3B6")
hist(nonfightpok$attack, main = "Non Fighting Pokemons attack", xlab = "Base Total of Non Fighting pokemons", col= "#8BC3B6")
hist(rnorm(5000,mean=500,sd=50), main = "Normal control", xlab = "Normal function", col="#A6B91A")
By the above histogram the base total and Attack data for Fighting and Non Fighting doesn’t seem to be a normally distributed
par(mfrow=c(2,3))
qqnorm(fighterpok$base_total, col = "#8BC3B6", main = "Fighting Pokemon Base total")
qqline(fighterpok$base_total)
qqnorm(nonfightpok$base_total, col="#8BC3B6", main = "Non Fighting Pokemon Base total")
qqline(nonfightpok$base_total)
qqnorm(fighterpok$attack, col="#8BC3B6", main = "Fighting Pokemon Attack")
qqline(fighterpok$attack)
qqnorm(nonfightpok$attack, col="#8BC3B6", main = "Fighting Pokemon Attack")
qqline(nonfightpok$attack)
qqnorm(rnorm(5000,mean=500,sd=50), col="#A6B91A", main = "Normal Control")
qqline(rnorm(5000,mean=500,sd=50))
By the above qqplots the base total and Attack data for Fighting and Non fighting doesn’t seem to be a normally distributed
shapiro.test(fighterpok$base_total)
##
## Shapiro-Wilk normality test
##
## data: fighterpok$base_total
## W = 0.96005, p-value = 0.07379
P value is significant so we can reject our Null Hypotheis, mean base total for fighting pokemons is normally distributed.
shapiro.test(nonfightpok$base_total)
##
## Shapiro-Wilk normality test
##
## data: nonfightpok$base_total
## W = 0.98024, p-value = 6.195e-09
shapiro.test(nonfightpok$attack)
##
## Shapiro-Wilk normality test
##
## data: nonfightpok$attack
## W = 0.97948, p-value = 3.581e-09
shapiro.test(nonfightpok$attack)
##
## Shapiro-Wilk normality test
##
## data: nonfightpok$attack
## W = 0.97948, p-value = 3.581e-09
P value is not significant so we can reject our Null Hypotheis, means base total for Non fighter Pokemons isn’t normally distributed.
As we can only reject the hypothesis, as by boxplot above we can visualise the mean of base total and attack of Fighting pokemons are greater then that of Non Fighting pokemons so we can say. ##### Wilcoxon test * Null Hypothesis: Mean of base total and attack of Fighting pokemons is lesser. * Alternate Hypothesis: Mean of base total and attack of Fighting pokemons is greater.
wilcox.test(fighterpok$base_total, nonfightpok$base_total, alternative = "greater")
##
## Wilcoxon rank sum test with continuity correction
##
## data: fighterpok$base_total and nonfightpok$base_total
## W = 24521, p-value = 0.02912
## alternative hypothesis: true location shift is greater than 0
wilcox.test(fighterpok$attack, nonfightpok$attack, alternative = "greater")
##
## Wilcoxon rank sum test with continuity correction
##
## data: fighterpok$attack and nonfightpok$attack
## W = 30916, p-value = 1.251e-08
## alternative hypothesis: true location shift is greater than 0
As we can see the P values are very less so we can reject our hypothesis.
By above tests we can say that Fighting pokemons are better in attacking and in Base total than non fighting pokemons