Mergind and Analysing Data

Recitation 4

Rithika Kumar (joint with Alex Toolkin)

September 26, 2019

PS1 is out!

Review your problem sets. As Professor Hopkins noted, most people get 2s (checks). Everyone got feedback on the problem set, so read that as that is the important part.

Goals for Today

  1. DPLYR
  2. Merging
  3. Data Exploration

DPLYR

The wonders of dplyr

DPLYR versus Base R

How do we get all the names of first generation pokemon?

Remember, piping does not modify your objects or create new objects!

Dplyr is sweet

https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Dplyr is sweet

pokedex %>% 
  mutate(
    sp_total = sp_attack + sp_defense
  )

Dplyr is sweet

And you can keep stacking transformations!

Example: What is the average sp_attack + sp_defense of each generation of pokemon?

pokedex %>%
  mutate(sp_total = sp_attack + sp_defense) %>%
  group_by(generation) %>%
  summarise(mean(sp_total))

Merging

Types of Merges

Joins in code

lfp1 <- read.csv("data/lfp1.csv")
lfp2 <- read.csv("data/lfp2.csv")
lfp2$id <- lfp2$id.no

merged_lfp <- lfp1 %>% inner_join(lfp2, by = "id")

Joins in code

# examples of various joins
merged_lfp <- lfp1 %>% full_join(lfp2, by = "id")
merged_lfp <- merge(lfp1,lfp2,by="id")
merged_lfp <- right_join(lfp1,lfp2,by="id")

Data Exploration

Summarizing Data

Using summary statistics on our data

Say we’re interested in 1st gen pokemon attack

firstg_attack <- pokedex %>% 
  filter(generation == 1) %>% 
  pull(attack)

mean(firstg_attack)
median(firstg_attack)
var(firstg_attack)
sd(firstg_attack)
quantile(firstg_attack)

#To get the deciles
quantile(pokedex$speed, prob = seq(0, 1, length = 11)) 

#This can be modified for other cuts as well. 

DANGER - NAs!

Note that summary statistics don’t work if you have NAs!

mean(merged_lfp$k5)

uh oh!

sum(is.na(merged_lfp3$k5))
lfp.merged.nona <- na.omit(merged_lfp3$k5)
mean(lfp.merged.nona)

Exercise

Exercise

Explore the Pokemon data set - some sample questions to try.