In this lab activity, we’ll discuss Data Wrangling/Transformation via the dplyr package. We’ll explore ways to choose subsets of data, aggregate data to create summaries, make new variables, and sort your data frames. It is recommended you also explore the RStudio Cheatsheet on Data Transformation as we discuss this content.

Back to gapminder

Here is a look at the gapminder data frame in the gapminder package.

library(gapminder)
rmarkdown::paged_table(gapminder)
# Has been set as the default way to print data frames
# via df_print: paged in the YAML at the top

Say we wanted mean life expectancy across all years for Asia

# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)
[1] 60.0649
library(dplyr)
gapminder %>% 
  filter(continent == "Asia") %>%
  summarize(mean_exp = mean(lifeExp))

filter()

library(gapminder)
library(dplyr)
gap_2007 <- gapminder %>% filter(year == 2007)
gap_2007

Logical operators

gapminder %>% 
  filter(year == 2002 | continent == "Europe")
gapminder %>% 
  filter(year == 2002, continent == "Europe")
gapminder %>% 
  filter(country %in% c("Argentina", "Belgium", "Mexico"),
         year %in% c(1987, 1992))

summarize()

max_exp_1997 <- gapminder %>% 
  filter(year == 1997) %>% 
  summarize(max_exp = max(lifeExp))
max_exp_1997

Combining summarize() with group_by()

When you’d like to determine a numerical summary for all levels of a different categorical variable

max_exp_1997_by_cont <- gapminder %>% 
  filter(year == 1997) %>% 
  group_by(continent) %>%
  summarize(max_exp = max(lifeExp))
max_exp_1997_by_cont

Without the %>%

It’s hard to appreciate the %>% without seeing what the code would look like without it:

max_exp_1997_by_cont <- 
  summarize(
    group_by(
      filter(
        gapminder, 
          year == 1997), 
      continent),
    max_exp = max(lifeExp))
max_exp_1997_by_cont

mutate()

  1. create a new variable based on other variables
gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
gap_w_gdp

mutate()

  1. change the contents of an existing variable
gap_weird <- gapminder %>% mutate(pop = pop + 1000)
gap_weird

arrange()

gapminder %>%
  arrange(year, country)
gapminder %>%
  filter(year > 2000) %>%
  arrange(desc(lifeExp))

Other useful dplyr verbs

Determine which African country had the highest GDP per capita in 1982 using the gapminder data in the gapminder package.

library(gapminder)
library(dplyr)
library(tidyr)

highest_gdp <- gapminder %>%
   filter(continent == "Africa") %>% #filters for Africa as the continent
  arrange(desc(gdpPercap)) %>% #arranges the filtered data in descending order
  slice_head(n = 5) #give the top 5

print(highest_gdp)
# A tibble: 5 × 6
  country continent  year lifeExp     pop gdpPercap
  <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
1 Libya   Africa     1977    57.4 2721783    21951.
2 Gabon   Africa     1977    52.8  706367    21746.
3 Libya   Africa     1972    52.8 2183877    21011.
4 Libya   Africa     1967    50.2 1759224    18773.
5 Libya   Africa     1982    62.2 3344074    17364.

#The country with the highest GDP per capita in Africa is Libya.

Challenge

For both of these problems below, use the bechdel data frame in the fivethirtyeight package:

library(fivethirtyeight)
library(dplyr)
library(tidyr)

movies_count <- fivethirtyeight::bechdel %>% #pulls the bechdel data frame from
  #the fivethirtyeight package and names a new variable movies_count with 
  #the following parameters
   filter(year == '2013') %>% #filters for the year 2013
  count(clean_test) #from 2013 data, counts the number of values in clean_test

print(movies_count) #prints that count
# A tibble: 5 × 2
  clean_test     n
  <ord>      <int>
1 nowomen        6
2 notalk        23
3 men           15
4 dubious        9
5 ok            46

library(fivethirtyeight)
library(dplyr)
library(tidyr)

ok_movies <- fivethirtyeight::bechdel %>% 
   filter(clean_test == 'ok') #filters data that received 'ok' under clean_test
  #creates a new variable called ok_movies
count(ok_movies) #does a count of values in ok_movies
ok_percentage <- (count(ok_movies) / 1794) * 100 #creates a new variable that 
  #calculates the percentage 


print(ok_percentage) #prints percentage
         n
1 44.76031

Your Task

Determine the top five movies in terms of domestic return on investment for 2013 scaled data using the bechdel data frame in the fivethirtyeight package.

fivethirtyeight::bechdel %>%
  filter(domgross_2013 > 0) %>% #filters for the domgross_2013 column 
  arrange(desc(domgross_2013)) %>% #arranges that domgross_2013 column 
  #descending order
  slice_head(n = 5) #pulls out the top 5 dom grossing movies
---
title: 'Data Wrangling Lab2: Intro to `dplyr`'
author: "Jacinda Evans"
output:
  html_document:
    code_download: yes
    code_folding: hide
    df_print: paged
  pdf_document: default
---

```{r include=FALSE}
library(tidyverse)
knitr::opts_chunk$set(message=FALSE)
filter <- dplyr::filter
knitr::opts_chunk$set(warning=FALSE, message=FALSE, fig.width=9.5, fig.height=4.5, comment=NA, rows.print=16, out.width = "\\textwidth")
theme_set(theme_gray(base_size = 20))
```

In this lab activity, we'll discuss Data Wrangling/Transformation via the `dplyr` package. We'll explore ways to choose subsets of data, aggregate data to create summaries, make new variables, and sort your data frames. It is recommended you also explore the RStudio Cheatsheet on [Data Transformation](https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf) as we discuss this content.

### Back to `gapminder`

Here is a look at the `gapminder` data frame in the `gapminder` package.

```{r}
library(gapminder)
rmarkdown::paged_table(gapminder)
# Has been set as the default way to print data frames
# via df_print: paged in the YAML at the top
```

Say we wanted mean life expectancy across all years for Asia

```{r}
# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)
```

```{r}
library(dplyr)
gapminder %>% 
  filter(continent == "Asia") %>%
  summarize(mean_exp = mean(lifeExp))
```


------------------------------------------------------------------------

## `filter()`

-   Select a subset of the rows of a data frame.
-   The arguments are the "filters" that you'd like to apply.

```{r}
library(gapminder)
library(dplyr)
gap_2007 <- gapminder %>% filter(year == 2007)
gap_2007
```

-   Use `==` to compare a variable to a value

## Logical operators

-   Use `|` to check for any in multiple filters being true:

```{r}
gapminder %>% 
  filter(year == 2002 | continent == "Europe")
```

-   Use `&` or `,` to check for all of multiple filters being true:

```{r}
gapminder %>% 
  filter(year == 2002, continent == "Europe")
```

-   Use `%in%` to check for any being true (shortcut to using `|` repeatedly with `==`)

```{r}
gapminder %>% 
  filter(country %in% c("Argentina", "Belgium", "Mexico"),
         year %in% c(1987, 1992))
```

## `summarize()`

-   Any numerical summary that you want to apply to a column of a data frame is specified within `summarize()`.

```{r}
max_exp_1997 <- gapminder %>% 
  filter(year == 1997) %>% 
  summarize(max_exp = max(lifeExp))
max_exp_1997
```

### Combining `summarize()` with `group_by()`

When you'd like to determine a numerical summary for all levels of a different categorical variable

```{r}
max_exp_1997_by_cont <- gapminder %>% 
  filter(year == 1997) %>% 
  group_by(continent) %>%
  summarize(max_exp = max(lifeExp))
max_exp_1997_by_cont
```

### Without the `%>%`

It's hard to appreciate the `%>%` without seeing what the code would look like without it:

```{r}
max_exp_1997_by_cont <- 
  summarize(
    group_by(
      filter(
        gapminder, 
          year == 1997), 
      continent),
    max_exp = max(lifeExp))
max_exp_1997_by_cont
```

## `mutate()`

-   Allows you to
    1.  create a new variable based on other variables OR
    2.  change the contents of an existing variable

1.  create a new variable based on other variables

```{r}
gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
gap_w_gdp
```

## `mutate()`

3.  change the contents of an existing variable

```{r}
gap_weird <- gapminder %>% mutate(pop = pop + 1000)
gap_weird
```

## `arrange()`

-   Reorders the rows in a data frame based on the values of one or more variables

```{r}
gapminder %>%
  arrange(year, country)
```

-   Can also put into descending order

```{r desc}
gapminder %>%
  filter(year > 2000) %>%
  arrange(desc(lifeExp))
```

## Other useful `dplyr` verbs

-   `select`
-   `top_n`
-   `sample_n`
-   `slice`
-   `glimpse`
-   `rename`
---------------------------------------------------------------------------------
## Your Task

Determine which African country had the highest GDP per capita in 1982 using the `gapminder` data in the `gapminder` package.

```{r}
library(gapminder)
library(dplyr)
library(tidyr)

highest_gdp <- gapminder %>%
   filter(continent == "Africa") %>% #filters for Africa as the continent
  arrange(desc(gdpPercap)) %>% #arranges the filtered data in descending order
  slice_head(n = 5) #give the top 5

print(highest_gdp)
```
#The country with the highest GDP per capita in Africa is Libya. 


**Challenge**

For both of these problems below, use the `bechdel` data frame in the `fivethirtyeight` package:

-   Use the `count` function in the `dplyr` package to determine how many movies in 2013 fell into each of the different categories for `clean_test`
-   Determine the percentage of movies that received the value of `"ok"` across all years



```{r}
library(fivethirtyeight)
library(dplyr)
library(tidyr)

movies_count <- fivethirtyeight::bechdel %>% #pulls the bechdel data frame from
  #the fivethirtyeight package and names a new variable movies_count with 
  #the following parameters
   filter(year == '2013') %>% #filters for the year 2013
  count(clean_test) #from 2013 data, counts the number of values in clean_test

print(movies_count) #prints that count
```

------------------------------------------------------------------------



```{r}
library(fivethirtyeight)
library(dplyr)
library(tidyr)

ok_movies <- fivethirtyeight::bechdel %>% 
   filter(clean_test == 'ok') #filters data that received 'ok' under clean_test
  #creates a new variable called ok_movies
count(ok_movies) #does a count of values in ok_movies


ok_percentage <- (count(ok_movies) / 1794) * 100 #creates a new variable that 
  #calculates the percentage 


print(ok_percentage) #prints percentage
```


## Your Task

Determine the top five movies in terms of domestic return on investment for 2013 scaled data using the `bechdel` data frame in the `fivethirtyeight` package.

```{r}
fivethirtyeight::bechdel %>%
  filter(domgross_2013 > 0) %>% #filters for the domgross_2013 column 
  arrange(desc(domgross_2013)) %>% #arranges that domgross_2013 column 
  #descending order
  slice_head(n = 5) #pulls out the top 5 dom grossing movies
```
