Page 1: Introduction

Column

Quantified Self (QS), also known as “Lifelogging”, applied data science to an individual’s life quality. QS first came into the public at a Wired magazine movement led by Gary Wolf and Kevin Kelly in 2007. In the beginning, it referred to the healthcare metric tracking the personal measurements such as heart rate, blood pressure, body weight, and sleep circle. The movement has grown since then and extended beyond kudos to the emerging market in wearable technology, e.g., Apple Watch, Samsung Gear, Fitbit, Mi Band, and JawBone.

Pokémon GO is an augmented reality (AR) and location-based mobile game both on iOS and Android platforms published by Niantic Inc. and released on July 6th, 2016. It is so popular that the game reached 800 million downloads in May 2018. I have been playing the game since the launch day and spent time and in-app purchase (IAP). The objective is to analyze and visualize the data from the perspective of QS.

The dataset is collected from March 22nd, 2017 to February 16th, 2019. Table 1 shows the cleaned dataset with six attributes and 1662 instances. “CP” means “combat power” while “IV” means “individual values”. Legendary or Mythic Pokémon are rarely obtainable from Level-5 Gym raiding or Field and Special Research or Trading. “CP”, “IV”, “Fast Move”, and “Charged Move” are randomly assigned to Pokemon. More details will be included in each analysis.

Column

Table 1: Pokémon GO Legendaries

Pokémon GO Logo

Page 2: Visualization

QUESTION ONE: Is Legendary Pokémon IV normally distributed?

Columns “CP”, “Fast Move”, and “Charged Move” are not taken into consideration for several reasons. First, these statistics have been rebalanced, e.g., in October 2018, from the server side. Multiple Pokémon statistics have been modified. This dataset did not update the values of CP before it. Second, CP varies from different Pokémon species, instead, IV is a more universal way to appraise any Pokémon. Traded Pokémon are all excluded because Trading triggers IV re-roll.

IV sums up three components, namely Attack rate, Defence rate, and Stamina (HP) rate. As for Legendary Pokémon, the initial base is 10+10+10 at least, and the maximum cap is 15+15+15. If Atack, Defence, and Stamina is 15, 10 and 11 respectively, IV is calculated as 15+10+11/45=80%. Sometimes, many IV results may occur, then the arithmetic mean of the possible IVs is used.

QUESTION ONE: Is Legendary Pokémon IV normally distributed?

As Fig. 1 shows, the density line in red does not follow the normal/Gaussian distribution. Such a line has two modes. By running the Shapiro-Wilk normality test, the p-value of 9.9e-12 is statistically significant at an alpha level of 0.05, hence, reject the null hypothesis that the dataset is normally distributed.

In fact, according to the definition of IV, the random variables are discrete. Therefore, the normal distribution is not a reasonable assumption in the first place. MORE

QUESTION TWO: Are accounts independent from each other?

Column “Owner” has two factors, namely “WZX” and “SDQ”. I have been playing the game on two accounts and recording both datasets for analysis. Account “SDQ” has 858 observations, and Account “WZX” has 810 observations.

QUESTION TWO: Are accounts independent from each other?

Fig. 2 shows that both two accounts share a similar multimodal distribution. Since the dataset does not meet one of the assumptions that the random variable follows the normal distribution, two sample T-test cannot be performed in this case.

	100	67-69	70-72	73-75	76-78	79-81	82-84	85-87	88-90	91-93	94-96	97-99
SDQ	5	16	28	44	140	89	207	106	75	105	28	15
WZX	1	18	34	46	135	91	191	94	74	83	29	14

It seems that Account “SDQ” has better luck on perfect IV (100%). Chi-squared test is run to test independence. The p-value of 0.8637 is not statistically significant at an alpha level of 0.05, hence accept the null hypothesis that two accounts behaved independently.


    Pearson's Chi-squared test

data:  chi
X-squared = 6.1427, df = 11, p-value = 0.8637

QUESTION THREE: Which Legendary Pokémon did I spend most efforts?

Column “Name” has 23 factors, namely Rayquaza, Articuno, Mewtwo, Moltres, Groudon, Ho-Oh, Latias, Latios, Lugia, Zapdos, Suicune, Kyogre, Entei, Regice, Raikou, Regirock, Registeel, Mew, Giratina, Deoxys, Cresselia, Heatran, and Palkia. Meltan and Melmetal are not recorded.

QUESTION THREE: Which Legendary Pokémon did I spend most efforts?

Top six Legendary Pokémon are Rayquaza, Mewtwo, Groudon, Zapdos, Lugia, and Ho-Oh with the total count of 273, 207, 166, 137, 134, and 104 respectively. Among 23 Legendary Pokémon, Rayquaza cost me most effort, and indeed this one is powerful in the game It is noticeable that Mew only has been caught twice (once on each account) because Mythic Pokémon like Mew only exist one kind. Palkia was just released on January 29th, 2019; hence sample size is not enough. The sample size of Regirock, Registeel, Latias, Giratina is also small because they are not useful in the game. Deoxys is the current EX Raid boss, which is rare in quantity.

QUESTION FOUR: Do Top six Legendary Pokémon have the same distribution?

Fig. 4 is a boxplot with notch illustrating Top six Legendary Pokémon’s minimum, first quartile, median, third quartile, and maximum in each dataset. Top six Legendary Pokémon are selected by the count over 100 as shown in Fig. 3.

QUESTION FOUR: Do Top six Legendary Pokémon have the same distribution?

In this case, Mewtwo and Rayquaza have the perfect IV. Ho-Oh, Lugia, Mewtwo, and Rayquaza have a similar median around 82%. Groudon and Zapdos have the same median of 84%. It is noticeable that Ho-Oh and Lugia are of the same performance. Moreover, they suppose to be a pair in the game.

QUESTION FIVE: Which Legendary Pokémon should I improve on catching in the future?

The grouped frequency table is shown in Fig. 5: the warmer color, the higher IV it is; the cooler color, the lower IV it is; the larger circle, the more frequency count it is; the smaller circle, the less frequency count it is.

Both two accounts caught a similar amount of certain Legendary Pokémon except that Account “WZX” did not participate much Palkia raids in recent.

QUESTION FIVE: Which Legendary Pokémon should I improve on catching in the future?

Suggestion on the next step:
1) to catch more Palkia on Account “WZX”;
2) to catch more Deoxys on Account “WZX” due to low IV on the sample;
3) to catch more Giratina on Account “SDQ” due to low IV;
4) to catch more Latios on both due to low IV;
5) to catch more Registeel on both due to low IV.

---
title: "ANLY-512: The Quantified Self"
author: "Zhengxiao Wei"
date: "`r Sys.Date()`"
output:
  flexdashboard::flex_dashboard:
    source_code: embed
    navbar:
      - { title: "About", href: "https://moodle.harrisburgu.edu/pluginfile.php/524023/mod_assign/introattachment/0/Final_Project_Description.html?forcedownload=1", align: right }
---

```{r setup, include=F}
knitr::opts_chunk$set(echo=F)
options(warn=-1)

if(!require(dplyr)) {install.packages("dplyr")}
if(!require(DT)) {install.packages("DT")}
if(!require(flexdashboard)) {install.packages("flexdashboard")}
if(!require(ggplot2)) {install.packages("ggplot2")}
if(!require(ggpubr)) {install.packages("ggpubr")}
if(!require(kableExtra)) {install.packages("kableExtra")}

library(flexdashboard)
library(ggplot2)
library(ggpubr)
library(kableExtra)
```

```{r setup-data, cache=F, include=F}
legendary <- read.csv("~/Documents/HU/ANLY 512-91-O/15-Project Summary & Course Final/512 Final Project/PokemonGOLegendary.csv",stringsAsFactors=F,na.strings=c(""))
legendaryBackup <- legendary #raw data

#data cleaning
legendary <- subset(legendary,!is.na(Name)&IV!="/"&is.na(Trade))
legendary[,c(1,4,5,6)] <- lapply(legendary[,c(1,4,5,6)], as.factor)
legendary[,c(2,3)] <- lapply(legendary[,c(2,3)], as.numeric)
legendary <- subset(legendary,IV>=67)
legendary <- legendary[,-7]
rownames(legendary) <- NULL
```

Page 1: Introduction
=======================================================================

Column {.sidebar}
-----------------------------------------------------------------------

**Quantified Self** (QS), also known as "Lifelogging", applied data science to an individual's life quality. QS first came into the public at a Wired magazine movement led by Gary Wolf and Kevin Kelly in 2007. In the beginning, it referred to the healthcare metric tracking the personal measurements such as heart rate, blood pressure, body weight, and sleep circle. The movement has grown since then and extended beyond kudos to the emerging market in wearable technology, e.g., Apple Watch, Samsung Gear, Fitbit, Mi Band, and JawBone.

**Pokémon GO** is an augmented reality (AR) and location-based mobile game both on iOS and Android platforms published by Niantic Inc. and released on July 6th, 2016. It is so popular that the game reached 800 million downloads in May 2018. I have been playing the game since the launch day and spent time and in-app purchase (IAP). The objective is to analyze and visualize the data from the perspective of QS.

The dataset is collected from March 22nd, 2017 to February 16th, 2019. Table 1 shows the cleaned dataset with six attributes and 1662 instances. "CP" means "combat power" while "IV" means "individual values". Legendary or Mythic Pokémon are rarely obtainable from Level-5 Gym raiding or Field and Special Research or Trading. "CP", "IV", "Fast Move", and "Charged Move" are randomly assigned to Pokemon. More details will be included in each analysis.

Column
-----------------------------------------------------------------------

### Table 1: Pokémon GO Legendaries {data-height=750}
```{r table1}
#cleaned data
DT::datatable(legendary,filter="top")
```

### Pokémon GO Logo 
```{r logo, out.width="100%"}
knitr::include_graphics("/Users/sherloconan/Documents/HU/ANLY 512-91-O/15-Project Summary & Course Final/512 Final Project/PokemonGOlogo.png")
```

Page 2: Visualization {.storyboard}
=======================================================================

### **QUESTION ONE: Is Legendary Pokémon IV normally distributed?** {data-commentary-width=400}

```{r histogram}
p1 <- ggplot(legendary,aes(IV))+geom_histogram(aes(y =..density..),binwidth=2)+ggtitle("Fig. 1. Histogram of Legendary Pokémon")+labs(x="Individual Values (%)",y="Density")+geom_density(col=2)+theme_light();p1
```

*** 

Columns "CP", "Fast Move", and "Charged Move" are not taken into consideration for several reasons. First, these statistics have been rebalanced, e.g., in October 2018, from the server side. Multiple Pokémon statistics have been modified. This dataset did not update the values of CP before it. Second, CP varies from different Pokémon species, instead, IV is a more universal way to appraise any Pokémon. Traded Pokémon are all excluded because Trading triggers IV re-roll.

IV sums up three components, namely Attack rate, Defence rate, and Stamina (HP) rate. As for Legendary Pokémon, the initial base is 10+10+10 at least, and the maximum cap is 15+15+15. If Atack, Defence, and Stamina is 15, 10 and 11 respectively, IV is calculated as 15+10+11/45=80%. Sometimes, many IV results may occur, then the arithmetic mean of the possible IVs is used. 

 QUESTION ONE: Is Legendary Pokémon IV normally distributed? 

```{r normality test, include=F}
shapiro.test(legendary$IV)
```

As Fig. 1 shows, the density line in red does not follow the normal/Gaussian distribution. Such a line has two modes. By running the Shapiro-Wilk normality test, the p-value of 9.9e-12 is statistically significant at an alpha level of 0.05, hence, **reject the null hypothesis that the dataset is normally distributed**.

In fact, according to the definition of IV, the random variables are discrete. Therefore, the normal distribution is not a reasonable assumption in the first place. [MORE](http://community.dur.ac.uk/c.c.d.s.caiado/multinomial.pdf)


### **QUESTION TWO: Are accounts independent from each other?** {data-commentary-width=500}

```{r histogram continued}
p2 <- qplot(IV,data=legendary,geom="density",color=Owner)+ggtitle("Fig. 2. Density of Legendary Pokémon")+labs(x="Individual Values (%)",y="Density")+theme_light();p2
```

*** 

Column "Owner" has two factors, namely "WZX" and "SDQ". I have been playing the game on two accounts and recording both datasets for analysis. Account "SDQ" has 858 observations, and Account "WZX" has 810 observations.

 QUESTION TWO: Are accounts independent from each other? 

Fig. 2 shows that both two accounts share a similar multimodal distribution. Since the dataset does not meet one of the assumptions that the random variable follows the normal distribution, two sample T-test cannot be performed in this case.

```{r chi-squared test}
#break IV into ranks
legendary$Rank <- dplyr::case_when(legendary$IV>=67&legendary$IV<70~"67-69",legendary$IV>=70&legendary$IV<73~"70-72",legendary$IV>=73&legendary$IV<76~"73-75",legendary$IV>=76&legendary$IV<79~"76-78",legendary$IV>=79&legendary$IV<82~"79-81",legendary$IV>=82&legendary$IV<85~"82-84",legendary$IV>=85&legendary$IV<88~"85-87",legendary$IV>=88&legendary$IV<91~"88-90",legendary$IV>=91&legendary$IV<94~"91-93",legendary$IV>=94&legendary$IV<97~"94-96",legendary$IV>=97&legendary$IV<100~"97-99",TRUE~"100")

#prepare a pivot table
freqtable <- table(legendary$Owner,legendary$Rank)
freqtable %>% kable() %>% kable_styling()
```

It seems that Account "SDQ" has better luck on perfect IV (100%). Chi-squared test is run to test independence. The p-value of 0.8637 is not statistically significant at an alpha level of 0.05, hence **accept the null hypothesis that two accounts behaved independently**.

```{r chi-squared test continued}
chi <- data.frame(freqtable[1,],freqtable[2,])
colnames(chi) <- rownames(freqtable)
chisq.test(chi)
```

### **QUESTION THREE: Which Legendary Pokémon did I spend most efforts?** {data-commentary-width=400}

```{r barplot}
caught <- as.data.frame(table(legendary$Name))
p3 <- ggplot(caught,aes(x=reorder(Var1,-Freq),y=Freq))+geom_bar(stat="identity",fill="steelblue")+theme(axis.text.x=element_text(angle=45,hjust=1))+labs(x="Pokémon Name",y="Count")+ggtitle("Fig. 3. Caught Pokémon Sorted by Descending Order")+geom_text(aes(label=Freq),vjust=-0.5,color="red",size=3);p3
```

*** 

Column "Name" has 23 factors, namely Rayquaza, Articuno, Mewtwo, Moltres, Groudon, Ho-Oh, Latias, Latios, Lugia, Zapdos, Suicune, Kyogre, Entei, Regice, Raikou, Regirock, Registeel, Mew, Giratina, Deoxys, Cresselia, Heatran, and Palkia. Meltan and Melmetal are not recorded.

 

 QUESTION THREE: Which Legendary Pokémon did I spend most efforts? 

 

Top six Legendary Pokémon are Rayquaza, Mewtwo, Groudon, Zapdos, Lugia, and Ho-Oh with the total count of 273, 207, 166, 137, 134, and 104 respectively. Among 23 Legendary Pokémon, **Rayquaza cost me most effort**, and indeed this one is powerful in the game It is noticeable that Mew only has been caught twice (once on each account) because Mythic Pokémon like Mew only exist one kind. Palkia was just released on January 29th, 2019; hence sample size is not enough. The sample size of Regirock, Registeel, Latias, Giratina is also small because they are not useful in the game. Deoxys is the current EX Raid boss, which is rare in quantity.


### **QUESTION FOUR: Do Top six Legendary Pokémon have the same distribution?**  {data-commentary-width=400}

```{r boxplot}
#Top six Legendary Pokémon
top6 <- unique(caught[caught$Freq>100,]$Var1)

p4 <- ggplot(subset(legendary,Name%in%top6),aes(Name,IV))+geom_boxplot(fill="#4271AE",color="#1F3552",size=1,notch=T)+theme_light()+labs(y="Individual Values (%)",x="Pokémon Name")+ggtitle("Fig. 4. Boxplot of IV by Top Six Legendary Pokémon");p4
```

*** 

Fig. 4 is a boxplot with notch illustrating Top six Legendary Pokémon's minimum, first quartile, median, third quartile, and maximum in each dataset. Top six Legendary Pokémon are selected by the count over 100 as shown in Fig. 3.

 

 QUESTION FOUR: Do Top six Legendary Pokémon have the same distribution? 

 

In this case, Mewtwo and Rayquaza have the perfect IV. Ho-Oh, Lugia, Mewtwo, and Rayquaza have a similar median around 82%. Groudon and Zapdos have the same median of 84%. It is noticeable that **Ho-Oh and Lugia are of the same performance**. Moreover, they suppose to be a pair in the game.


### **QUESTION FIVE: Which Legendary Pokémon should I improve on catching in the future?** {data-commentary-width=400}

```{r balloon}
#aggregate the data to calculate the mean and the count
data <- aggregate(IV~Name+Owner,data=legendary,mean)
data$Freq <- aggregate(IV~Name+Owner,data=legendary,length)[,3]

p5 <- ggballoonplot(data,x="Name",y="Owner",size="Freq",fill="IV",ggtheme=theme_bw())+scale_fill_viridis_c(option="C")+ggtitle("Fig. 5. Legendary Pokémon's IV in Size and Count in Color");p5
```

*** 

The grouped frequency table is shown in Fig. 5: the warmer color, the higher IV it is; the cooler color, the lower IV it is; the larger circle, the more frequency count it is; the smaller circle, the less frequency count it is.

 

Both two accounts caught a similar amount of certain Legendary Pokémon except that Account "WZX" did not participate much Palkia raids in recent.

 

 QUESTION FIVE: Which Legendary Pokémon should I improve on catching in the future? 

 

**Suggestion on the next step**:  
1) to catch more Palkia on Account "WZX";  
2) to catch more Deoxys on Account "WZX" due to low IV on the sample;  
3) to catch more Giratina on Account "SDQ" due to low IV;  
4) to catch more Latios on both due to low IV;  
5) to catch more Registeel on both due to low IV.