PS1 is out!

Review your problem sets. As Professor Hopkins noted, most people get 2s (checks). Everyone got feedback on the problem set, so read that as that is the important part.

Goals for Today

DPLYR
Merging
Data Exploration

The wonders of dplyr

Install dplyr with install.packages("dplyr")
Load dplyr with library(dplyr)
Load the pokemon data set

DPLYR versus Base R

How do we get all the names of first generation pokemon?

Base R: pokedex[pokedex$generation == 1,]$name
dplyr: pokedex %>% filter(generation == 1) %>% select(name)

Remember, piping does not modify your objects or create new objects!

Dplyr is sweet

The main functions are filter(), select(), group_by(), summarise(),
Review the cheat sheets for a great dplyr guide

https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Dplyr is sweet

You can do anything you could do in base R
Create new “total special variable”:

pokedex %>% 
  mutate(
    sp_total = sp_attack + sp_defense
  )

Dplyr is sweet

And you can keep stacking transformations!

Example: What is the average sp_attack + sp_defense of each generation of pokemon?

pokedex %>%
  mutate(sp_total = sp_attack + sp_defense) %>%
  group_by(generation) %>%
  summarise(mean(sp_total))

Types of Merges

Full Join: All rows and columns
Inner Join: Only rows where there are matching values
Anti Join: Only rows where there aren’t matching values

Joins in code

lfp1 <- read.csv("data/lfp1.csv")
lfp2 <- read.csv("data/lfp2.csv")
lfp2$id <- lfp2$id.no

merged_lfp <- lfp1 %>% inner_join(lfp2, by = "id")

Joins in code

# examples of various joins
merged_lfp <- lfp1 %>% full_join(lfp2, by = "id")
merged_lfp <- merge(lfp1,lfp2,by="id")
merged_lfp <- right_join(lfp1,lfp2,by="id")

Summarizing Data

Central tendency mean() median()
Dispersion var() sd()
Quantiles quantile()
Association cor() cov() table()

Using summary statistics on our data

Say we’re interested in 1st gen pokemon attack

firstg_attack <- pokedex %>% 
  filter(generation == 1) %>% 
  pull(attack)

mean(firstg_attack)
median(firstg_attack)
var(firstg_attack)
sd(firstg_attack)
quantile(firstg_attack)

#To get the deciles
quantile(pokedex$speed, prob = seq(0, 1, length = 11)) 

#This can be modified for other cuts as well.

DANGER - NAs!

Note that summary statistics don’t work if you have NAs!

mean(merged_lfp$k5)

uh oh!

sum(is.na(merged_lfp3$k5))
lfp.merged.nona <- na.omit(merged_lfp3$k5)
mean(lfp.merged.nona)

Exercise

Find out the IDs that don’t overlap between lfp1 and lfp2 Hint: Merge lfp1 and lfp2 using anti_join()
Use inner_join to merge lfp1 and lfp2. Now, find the tercile of income in the merged dataset Hint: Use inner join() and quantile()

Explore the Pokemon data set - some sample questions to try.

Which classification of pokemon has the highest average defense?
Which classification of pokemon has the most variance of defense?
Which classification of pokemon have the most heights missing?
Find out the mean of the column height_m. Remember to omit NAs
If you were to merge in new data, which ID would you use for the merge?
What type of merge would make sense?

Mergind and Analysing Data

Recitation 4

PS1 is out!

Goals for Today

DPLYR

The wonders of dplyr

DPLYR versus Base R

Dplyr is sweet

Dplyr is sweet

Dplyr is sweet

Merging

Types of Merges

Joins in code

Joins in code

Data Exploration

Summarizing Data

Using summary statistics on our data

DANGER - NAs!

Exercise

Exercise