R Markdown Pokemon

This document has been produced after watching Edurekha video from youtube https://www.youtube.com/watch?v=SeyghJ5cdm4&t=685s Machine Learning with R | Machine Learning Algorithms | Data Science Training | Edureka

My sincere thanks to edureka.

For more information, Please write to Eduraka at or call them at IND: 9606058406 / US: 18338555775 (toll free).

Instagram: https://www.instagram.com/edureka_lea… Facebook: https://www.facebook.com/edurekaIN/ Twitter: https://twitter.com/edurekain LinkedIn: https://www.linkedin.com/company/edureka

I had experimented with the code /models suggested in the video and added my own with different code and tools offered by jtools and other packages.

We undertake three tasks in this exercise.

  1. Data wrangling: This involves obtaining data from the website and reading into R and finding out the best pokemon in grass, water and fire types

  2. Data analysis using linear regression with a few models and find out how the attack capability of the pokemon is influenced by different combination of other parameters.We conclude which would be the best model out of them.

  3. Data Analysis using classification procedure Recursive Partitioning and Regression Trees

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caTools)

1.Data Wrangling

Let us read the pokemon dataset. The dataset was downloaded from https://www.kaggle.com/rounakbanik/pokemon/version/1 The Complete Pokemon Dataset Data on more than 800 Pokemon from all 7 Generations. accessed on 19th Oct 2019

pokemon <- read.csv("pokemon.csv")
str(pokemon)
## 'data.frame':    801 obs. of  41 variables:
##  $ abilities        : Factor w/ 482 levels "['Adaptability', 'Download', 'Analytic']",..: 244 244 244 22 22 22 453 453 453 348 ...
##  $ against_bug      : num  1 1 1 0.5 0.5 0.25 1 1 1 1 ...
##  $ against_dark     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_dragon   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_electric : num  0.5 0.5 0.5 1 1 2 2 2 2 1 ...
##  $ against_fairy    : num  0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 ...
##  $ against_fight    : num  0.5 0.5 0.5 1 1 0.5 1 1 1 0.5 ...
##  $ against_fire     : num  2 2 2 0.5 0.5 0.5 0.5 0.5 0.5 2 ...
##  $ against_flying   : num  2 2 2 1 1 1 1 1 1 2 ...
##  $ against_ghost    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_grass    : num  0.25 0.25 0.25 0.5 0.5 0.25 2 2 2 0.5 ...
##  $ against_ground   : num  1 1 1 2 2 0 1 1 1 0.5 ...
##  $ against_ice      : num  2 2 2 0.5 0.5 1 0.5 0.5 0.5 1 ...
##  $ against_normal   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_poison   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_psychic  : num  2 2 2 1 1 1 1 1 1 1 ...
##  $ against_rock     : num  1 1 1 2 2 4 1 1 1 2 ...
##  $ against_steel    : num  1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 1 ...
##  $ against_water    : num  0.5 0.5 0.5 2 2 2 0.5 0.5 0.5 1 ...
##  $ attack           : int  49 62 100 52 64 104 48 63 103 30 ...
##  $ base_egg_steps   : int  5120 5120 5120 5120 5120 5120 5120 5120 5120 3840 ...
##  $ base_happiness   : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ base_total       : int  318 405 625 309 405 634 314 405 630 195 ...
##  $ capture_rate     : Factor w/ 34 levels "100","120","125",..: 26 26 26 26 26 26 26 26 26 21 ...
##  $ classfication    : Factor w/ 588 levels "Abundance Pokémon",..: 449 449 449 299 187 187 531 546 457 585 ...
##  $ defense          : int  49 63 123 43 58 78 65 80 120 35 ...
##  $ experience_growth: int  1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1000000 ...
##  $ height_m         : num  0.7 1 2 0.6 1.1 1.7 0.5 1 1.6 0.3 ...
##  $ hp               : int  45 60 80 39 58 78 44 59 79 45 ...
##  $ japanese_name    : Factor w/ 801 levels "Abagouraã‚¢ãƒ\220ゴーラ",..: 200 201 199 288 416 417 794 334 336 80 ...
##  $ name             : Factor w/ 801 levels "Abomasnow","Abra",..: 73 321 745 95 96 93 656 764 56 88 ...
##  $ percentage_male  : num  88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 50 ...
##  $ pokedex_number   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sp_attack        : int  65 80 122 60 80 159 50 65 135 20 ...
##  $ sp_defense       : int  65 80 120 50 65 115 64 80 115 20 ...
##  $ speed            : int  45 60 80 65 80 100 43 58 78 45 ...
##  $ type1            : Factor w/ 18 levels "bug","dark","dragon",..: 10 10 10 7 7 7 18 18 18 1 ...
##  $ type2            : Factor w/ 19 levels "","bug","dark",..: 15 15 15 1 1 9 1 1 1 1 ...
##  $ weight_kg        : num  6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
##  $ generation       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ is_legendary     : int  0 0 0 0 0 0 0 0 0 0 ...
pokemon$is_legendary <- as.factor(pokemon$is_legendary)

table(pokemon$type1) # Primary types
## 
##      bug     dark   dragon electric    fairy fighting     fire   flying 
##       72       29       27       39       18       28       52        3 
##    ghost    grass   ground      ice   normal   poison  psychic     rock 
##       27       78       32       23      105       32       53       45 
##    steel    water 
##       24      114
table(pokemon$type2) # Secondary types
## 
##               bug     dark   dragon electric    fairy fighting     fire 
##      384        5       21       17        9       29       25       13 
##   flying    ghost    grass   ground      ice   normal   poison  psychic 
##       95       14       20       34       15        4       34       29 
##     rock    steel    water 
##       14       22       17

Finding best Grass pokemon

grassPokemon <- pokemon %>% filter(type1 == "grass")
grassPoisonPokemon <- grassPokemon %>% filter(type2 == "poison")
mygrassPokemon <- grassPoisonPokemon %>%filter(speed == max(grassPoisonPokemon$speed))
mygp1 <- pokemon %>% filter(type1 == "grass" & type2 == "poison" )
mygp <- mygp1 %>% filter(speed == max(speed))

mygp and mygrassPokemon are the same obtained by different code.

Finding best Water pokemon

waterPokemon <- pokemon %>% filter(type1 == "water")
waterPsychicPokemon <- waterPokemon %>% filter(type2 == "psychic")
mywaterPokemon <- waterPsychicPokemon %>%filter(defense == max(waterPsychicPokemon$defense))

Finding best Fire pokemon

firePokemon <- pokemon %>% filter(type1 == "fire")
fireFightingPokemon <- firePokemon %>% filter(type2 == "fighting")
myfirePokemon <- fireFightingPokemon %>%filter(attack == max(fireFightingPokemon$attack))

Listing all the pokemons ready for Jhoto League with their capabilities

myPokemon <- rbind(mygrassPokemon,mywaterPokemon,myfirePokemon)
myPokemon[, c(31,37,38, 20,26,34,35,36)]
##       name type1    type2 attack defense sp_attack sp_defense speed
## 1 Roserade grass   poison     70      65       125        105    90
## 2  Slowbro water  psychic     75     180       130         80    30
## 3 Blaziken  fire fighting    160      80       130         80   100

2. Linear Regression

To find the factors influencing the attack capability of a pokemon we use linear regression first we split the pokemon data set into training and testing data set we use sample.split function of caTools package.

This function Splits data from given vector into two sets in predefined ratio while preserving relative ratios of different labels in the vector.

split_index <- sample.split(pokemon$attack,SplitRatio = 0.65)
train1 <- subset(pokemon, split_index == TRUE)
test1 <- subset(pokemon, split_index == FALSE)

We build the model using lm function. we set attack is the dependent variable and defense as independent variable we apply the model on the train set. We get a list lm1 Next, we apply the outcome of the application of model i.e lm1 on test set i.e test1 using predict function to get a vector of predicted results i.e result1 Then we combine the columns of actual and predicted values using cbind to get final_data matrix which is then converted as data frame. i.e by combining original values in pokemon dataset subsetted into test1 and predicted values available in result1 Then we compare the result of the prediction with the actual test data set and compute the error by subtraction of vectors Then we combine the error with actual and predicted columns Then root mean squared value of this dataframe is computed. Root Mean Squared Error, which measures the model prediction error. It corresponds to the average difference between the observed known values of the outcome and the predicted value by the model. RMSE is computed as RMSE = mean((observeds - predicteds)^2) %>% sqrt(). The lower the RMSE, the better the model.

Linear regression Model 1

lm1 <- lm(attack ~ defense,train1 )
result1 <- predict(lm1,test1)
final_data <- cbind(Actual = test1$attack, Predicted = result1)
final_data <- as.data.frame(final_data)

error <- final_data$Actual - final_data$Predicted
final_data <- cbind(final_data,error)
rmse1 <- sqrt(mean(final_data$error^2))

Linear regression Model 2

we now build another model and follow same procedures. this time we keep “attack” as the dependent variable and “defense”, “hp” and “speed” as independent variables

lm2 <- lm(attack ~ defense+speed+hp,train1 )
result2 <- predict(lm2,test1)
final_data2 <- cbind(Actual = test1$attack, Predicted = result2)
final_data2 <- as.data.frame(final_data2)

error2 <- final_data2$Actual - final_data2$Predicted
final_data2 <- cbind(final_data2,error2)
rmse2 <- sqrt(mean(final_data2$error2^2))

Linear regression Model 3

we now build yet another model and follow same procedures To use colnames from 2 to 19 in the formula instead of direct coding which is laborious, an interesting technique has been deployed!

# we concatenate colnames of pokemon from 2 to 19 to get a string
dv <- as.formula(paste(colnames(pokemon)[20], paste(colnames(pokemon)[2:19], collapse=" + "), sep=" ~ "))
lm3 <- lm(dv,train1 )
result3 <- predict(lm3,test1)
final_data3 <- cbind(Actual = test1$attack, Predicted = result3)
final_data3 <- as.data.frame(final_data3)

error3 <- final_data3$Actual - final_data3$Predicted
final_data3 <- cbind(final_data3,error3)
rmse3 <- sqrt(mean(final_data3$error3^2))

Linear regression Model 4

# we concatenate colnames of pokemon from 2 to 19, 26,29 & 36  to get a string
dv <- as.formula(paste(colnames(pokemon)[20], paste(colnames(pokemon)[c(2:19,26,29,36)], collapse=" + "), sep=" ~ "))
lm4<- lm(dv,train1 )
result4 <- predict(lm4,test1)
final_data4 <- cbind(Actual = test1$attack, Predicted = result4)
final_data4 <- as.data.frame(final_data4)

error4 <- final_data4$Actual - final_data4$Predicted
final_data4 <- cbind(final_data4,error4)
rmse4 <- sqrt(mean(final_data4$error4^2))

Out of all these linear regression models the least error occurs in model 4 as seen from their rmse values

Final Assessment

paste(" Model: ", " attack ~ defense", "gives an error " ,rmse1)
## [1] " Model:   attack ~ defense gives an error  29.2687739339837"
paste(" Model: ", " attack ~ defense+speed+hp", "gives an error " ,rmse2)
## [1] " Model:   attack ~ defense+speed+hp gives an error  24.5880628540036"
paste(" Model: ", " attack ~ against various odds ", "gives an error " ,rmse3)
## [1] " Model:   attack ~ against various odds  gives an error  31.0717663945909"
paste(" Model: ", " attack ~ against various odds and defense, speed and hp ", "gives an error " ,rmse4)
## [1] " Model:   attack ~ against various odds and defense, speed and hp  gives an error  23.902552658325"
paste (" Therefore the least error model is model 4")
## [1] " Therefore the least error model is model 4"
library(jtools)
library(ggstance)
library (kableExtra) # for better output format in R markdown
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows

Model Summary

Inspired by https://cran.r-project.org/web/packages/jtools/vignettes/summ.html Tools for summarizing and visualizing regression models Jacob Long 2019-04-08

Now let us visualise the model using library jtools and ggstance

summ(lm4)
Observations 527
Dependent variable attack
Type OLS linear regression
F(21,505) 26.18
0.52
Adj. R² 0.50
Est. S.E. t val. p
(Intercept) 44.48 17.79 2.50 0.01
against_bug 0.25 3.23 0.08 0.94
against_dark -24.15 4.78 -5.05 0.00
against_dragon 22.63 6.47 3.50 0.00
against_electric 1.84 3.19 0.58 0.56
against_fairy -3.68 3.87 -0.95 0.34
against_fight -0.21 2.99 -0.07 0.94
against_fire 2.79 3.31 0.84 0.40
against_flying 5.94 3.53 1.68 0.09
against_ghost 3.29 3.60 0.91 0.36
against_grass -2.59 1.92 -1.35 0.18
against_ground -7.10 2.92 -2.43 0.02
against_ice -4.52 2.24 -2.02 0.04
against_normal -22.53 6.22 -3.62 0.00
against_poison -5.04 3.59 -1.40 0.16
against_psychic -3.52 4.00 -0.88 0.38
against_rock -9.88 2.95 -3.35 0.00
against_steel 2.53 3.09 0.82 0.41
against_water 7.22 3.06 2.36 0.02
defense 0.37 0.04 9.21 0.00
hp 0.32 0.04 7.53 0.00
speed 0.35 0.04 9.77 0.00
Standard errors: OLS
summ(lm4, scale = TRUE)
Observations 527
Dependent variable attack
Type OLS linear regression
F(21,505) 26.18
0.52
Adj. R² 0.50
Est. S.E. t val. p
(Intercept) 77.89 1.00 77.84 0.00
against_bug 0.15 1.96 0.08 0.94
against_dark -10.32 2.04 -5.05 0.00
against_dragon 8.28 2.37 3.50 0.00
against_electric 1.17 2.02 0.58 0.56
against_fairy -1.89 1.99 -0.95 0.34
against_fight -0.15 2.10 -0.07 0.94
against_fire 1.97 2.34 0.84 0.40
against_flying 3.70 2.20 1.68 0.09
against_ghost 1.84 2.01 0.91 0.36
against_grass -2.04 1.51 -1.35 0.18
against_ground -5.13 2.11 -2.43 0.02
against_ice -3.39 1.68 -2.02 0.04
against_normal -5.76 1.59 -3.62 0.00
against_poison -2.82 2.01 -1.40 0.16
against_psychic -1.76 2.00 -0.88 0.38
against_rock -6.61 1.98 -3.35 0.00
against_steel 1.24 1.51 0.82 0.41
against_water 4.49 1.90 2.36 0.02
defense 10.96 1.19 9.21 0.00
hp 8.19 1.09 7.53 0.00
speed 10.45 1.07 9.77 0.00
Standard errors: OLS; Continuous predictors are mean-centered and scaled by 1 s.d.
summ(lm4, confint = TRUE, digits = 3)
Observations 527
Dependent variable attack
Type OLS linear regression
F(21,505) 26.179
0.521
Adj. R² 0.501
Est. 2.5% 97.5% t val. p
(Intercept) 44.482 9.529 79.435 2.500 0.013
against_bug 0.246 -6.106 6.597 0.076 0.939
against_dark -24.146 -33.543 -14.748 -5.048 0.000
against_dragon 22.625 9.919 35.332 3.498 0.001
against_electric 1.842 -4.416 8.101 0.578 0.563
against_fairy -3.684 -11.281 3.913 -0.953 0.341
against_fight -0.210 -6.092 5.673 -0.070 0.944
against_fire 2.789 -3.714 9.291 0.843 0.400
against_flying 5.945 -0.988 12.878 1.685 0.093
against_ghost 3.288 -3.779 10.356 0.914 0.361
against_grass -2.589 -6.361 1.184 -1.348 0.178
against_ground -7.101 -12.840 -1.362 -2.431 0.015
against_ice -4.522 -8.914 -0.130 -2.023 0.044
against_normal -22.534 -34.758 -10.310 -3.622 0.000
against_poison -5.036 -12.089 2.017 -1.403 0.161
against_psychic -3.518 -11.384 4.348 -0.879 0.380
against_rock -9.877 -15.678 -4.077 -3.346 0.001
against_steel 2.529 -3.532 8.590 0.820 0.413
against_water 7.217 1.213 13.222 2.362 0.019
defense 0.366 0.288 0.444 9.206 0.000
hp 0.316 0.233 0.398 7.531 0.000
speed 0.354 0.283 0.426 9.767 0.000
Standard errors: OLS
summ(lm4, confint = TRUE, digits = 3)
Observations 527
Dependent variable attack
Type OLS linear regression
F(21,505) 26.179
0.521
Adj. R² 0.501
Est. 2.5% 97.5% t val. p
(Intercept) 44.482 9.529 79.435 2.500 0.013
against_bug 0.246 -6.106 6.597 0.076 0.939
against_dark -24.146 -33.543 -14.748 -5.048 0.000
against_dragon 22.625 9.919 35.332 3.498 0.001
against_electric 1.842 -4.416 8.101 0.578 0.563
against_fairy -3.684 -11.281 3.913 -0.953 0.341
against_fight -0.210 -6.092 5.673 -0.070 0.944
against_fire 2.789 -3.714 9.291 0.843 0.400
against_flying 5.945 -0.988 12.878 1.685 0.093
against_ghost 3.288 -3.779 10.356 0.914 0.361
against_grass -2.589 -6.361 1.184 -1.348 0.178
against_ground -7.101 -12.840 -1.362 -2.431 0.015
against_ice -4.522 -8.914 -0.130 -2.023 0.044
against_normal -22.534 -34.758 -10.310 -3.622 0.000
against_poison -5.036 -12.089 2.017 -1.403 0.161
against_psychic -3.518 -11.384 4.348 -0.879 0.380
against_rock -9.877 -15.678 -4.077 -3.346 0.001
against_steel 2.529 -3.532 8.590 0.820 0.413
against_water 7.217 1.213 13.222 2.362 0.019
defense 0.366 0.288 0.444 9.206 0.000
hp 0.316 0.233 0.398 7.531 0.000
speed 0.354 0.283 0.426 9.767 0.000
Standard errors: OLS

Visualisation

#effect_plot(lm1,pred = attack, interval = TRUE, plot.points = TRUE)
# plot_coefs(lm4)  needs further investigation
plot_summs(lm4)

plot_summs(lm4,lm2, scale = TRUE)

plot_summs(lm4,lm2, scale = TRUE,plot.distributions = TRUE)

3.Classification

Now we will take an exercise of classification to find whether a given pokemon is legendary As in the previous case, we make training testing sets to proceed based on is.legendary column

library(rpart)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:ggstance':
## 
##     geom_errorbarh, GeomErrorbarh

Partitioning the data

split_index <- sample.split(pokemon$is_legendary, SplitRatio = 0.65)
train_legend <- subset(pokemon, split_index == TRUE)
test_legend <- subset(pokemon, split_index == FALSE)

Recursive Regression trees

model1 <-rpart(is_legendary~ . , data = train_legend)

result1 <- predict(model1, test_legend, type= "class")
 
table(test_legend$is_legendary, result1) 
##    result1
##       0   1
##   0 255   1
##   1   6  18

Construct confusion matrix and find accuracy of the model using caret package

direct command confusionmatrix

confusionMatrix(table(test_legend$is_legendary, result1)) 
## Confusion Matrix and Statistics
## 
##    result1
##       0   1
##   0 255   1
##   1   6  18
##                                           
##                Accuracy : 0.975           
##                  95% CI : (0.9492, 0.9899)
##     No Information Rate : 0.9321          
##     P-Value [Acc > NIR] : 0.001141        
##                                           
##                   Kappa : 0.8239          
##                                           
##  Mcnemar's Test P-Value : 0.130570        
##                                           
##             Sensitivity : 0.9770          
##             Specificity : 0.9474          
##          Pos Pred Value : 0.9961          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.9321          
##          Detection Rate : 0.9107          
##    Detection Prevalence : 0.9143          
##       Balanced Accuracy : 0.9622          
##                                           
##        'Positive' Class : 0               
## 
CM1 <- confusionMatrix(table(test_legend$is_legendary, result1)) 

Building a second model

Let us build another classifcation model using attack, defense, speed as independent variables and is_legendary as dependent variable

model2 <-rpart(is_legendary~ attack+defense+speed, data = train_legend)
plot(model2)
text(model2, use.n = TRUE)

result2 <- predict(model2, test_legend, type= "class")
table(test_legend$is_legendary, result2) 
##    result2
##       0   1
##   0 249   7
##   1  17   7
confusionMatrix(table(test_legend$is_legendary, result2)) # this also gives accuracy
## Confusion Matrix and Statistics
## 
##    result2
##       0   1
##   0 249   7
##   1  17   7
##                                           
##                Accuracy : 0.9143          
##                  95% CI : (0.8751, 0.9443)
##     No Information Rate : 0.95            
##     P-Value [Acc > NIR] : 0.99600         
##                                           
##                   Kappa : 0.3258          
##                                           
##  Mcnemar's Test P-Value : 0.06619         
##                                           
##             Sensitivity : 0.9361          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.9727          
##          Neg Pred Value : 0.2917          
##              Prevalence : 0.9500          
##          Detection Rate : 0.8893          
##    Detection Prevalence : 0.9143          
##       Balanced Accuracy : 0.7180          
##                                           
##        'Positive' Class : 0               
## 
CM2 <- confusionMatrix(table(test_legend$is_legendary, result2))

Final Assesment of the two models

paste(" Model 1: ", "is_legendary ~ .(i.e all columns) ", "gives an accuracy of ", CM1[[3]][1] )
## [1] " Model 1:  is_legendary ~ .(i.e all columns)  gives an accuracy of  0.975"
paste(" Model 2: ", " is_legendary ~ attack+ defense +speed", "gives an accuracy of " , CM2[[3]][1])
## [1] " Model 2:   is_legendary ~ attack+ defense +speed gives an accuracy of  0.914285714285714"
paste (" Therefore the first model gives better accuracy  ")
## [1] " Therefore the first model gives better accuracy  "

Thanks for your time

Note: This document has been made in the process of learning to code in R and exploring the concepts of machine learning and purely intended for educational purposes. Comments, suggestions for further reading and corrections are welcome.