Learning dplyr

I.Five things dplyr can do

We’ll learn the things you don’t expect you can do with dplyr.

Keith McNulty, Five thinngs you never knew you could do with dplyr

II.Before we start

First you have to find out the version of your dplyr package. You will learn some functions that are either new to dplyr 1.0.0+ or substantially improved.

This is how you confirm the version.

library("dplyr")
packageVersion("dplyr")

## [1] '1.0.5'

1.0.5 is the latest version.

1.rowwise()

We should know how to use rowwise( ) before we begin with the five things that we will learn here.

Let’s take a look at some examples.

Load the sample data.

# for getting the data via URL
library(readr)
# load the data
df = read_csv("https://pastebin.com/raw/nWkAe1qR")
head(df)

## # A tibble: 6 x 7
##      id english japanese nationality department  year gender
##   <dbl>   <dbl>    <dbl> <chr>       <chr>      <dbl> <chr> 
## 1     1    17.8     75.6 japanese    literature     1 male  
## 2     2    64.4     53.3 nepal       literature     1 male  
## 3     3    86.7     31.1 nepal       literature     1 male  
## 4     4    60       62.2 indonesia   literature     1 male  
## 5     5    42.2     80   japanese    literature     1 male  
## 6     6    33.3     75.6 japanese    literature     1 male

If you wanted to take the higher marks of English and Japanese, you might naturally try this. However, it is not what you intend to get.

df %>% 
  dplyr::mutate(max = max(english, japanese)) %>% 
  head()

## # A tibble: 6 x 8
##      id english japanese nationality department  year gender   max
##   <dbl>   <dbl>    <dbl> <chr>       <chr>      <dbl> <chr>  <dbl>
## 1     1    17.8     75.6 japanese    literature     1 male    93.3
## 2     2    64.4     53.3 nepal       literature     1 male    93.3
## 3     3    86.7     31.1 nepal       literature     1 male    93.3
## 4     4    60       62.2 indonesia   literature     1 male    93.3
## 5     5    42.2     80   japanese    literature     1 male    93.3
## 6     6    33.3     75.6 japanese    literature     1 male    93.3

You should use dplyr in the row-wise way.

df %>% 
  rowwise() %>% 
  mutate(max = max(english, japanese)) %>% 
  head()

## # A tibble: 6 x 8
## # Rowwise: 
##      id english japanese nationality department  year gender   max
##   <dbl>   <dbl>    <dbl> <chr>       <chr>      <dbl> <chr>  <dbl>
## 1     1    17.8     75.6 japanese    literature     1 male    75.6
## 2     2    64.4     53.3 nepal       literature     1 male    64.4
## 3     3    86.7     31.1 nepal       literature     1 male    86.7
## 4     4    60       62.2 indonesia   literature     1 male    62.2
## 5     5    42.2     80   japanese    literature     1 male    80  
## 6     6    33.3     75.6 japanese    literature     1 male    75.6

2.nest_by()

You should also know how to use nest_by().

Let’s take a look at an example with the above data.

This function is a shortcut function that creates rows of nested data. Here you will get the five lists of the lists grouped by nationalities.

df %>% 
  nest_by(nationality)

## # A tibble: 5 x 2
## # Rowwise:  nationality
##   nationality               data
##   <chr>       <list<tbl_df[,6]>>
## 1 china                 [12 × 6]
## 2 indonesia              [3 × 6]
## 3 japanese             [135 × 6]
## 4 nepal                 [26 × 6]
## 5 vietnam               [24 × 6]

III.Run may different models to test for fit

You can get tidy model outputs with the broom package.

library(broom)
broom::glance(
  lm(Volume ~ Girth + Height, trees)
)

## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.948         0.944  3.88      255. 1.07e-18     2  -84.5  177.  183.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

You will get all the fit statistics with a rowwise mutate of broom in the following way. Now we use a sample data set called trees.

# create a column with model formulas to test
models <- data.frame(
  formula = c(
    "Volume ~ Girth",
    "Volume ~ Girth + Height"
  )
)
# run them all and get fit statistics
models %>%
  dplyr::rowwise() %>%
  dplyr::mutate(
    broom::glance(lm(formula, trees))
  )

## # A tibble: 2 x 13
## # Rowwise: 
##   formula r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
##   <chr>       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
## 1 Volume…     0.935         0.933  4.25      419. 8.64e-19     1  -87.8  182.
## 2 Volume…     0.948         0.944  3.88      255. 1.07e-18     2  -84.5  177.
## # … with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## #   nobs <int>

IV.Create a batch of charts

Back to the English-Japanese exam, you can create charts by nationality with nest_by and mutate.

library(ggplot2)
scatters <- df %>%
  dplyr::nest_by(nationality) %>%
  dplyr::mutate(
    charts = list(
      ggplot(data, aes(x = english, y = japanese)) +
        geom_point()
    )
  )
print(scatters)

## # A tibble: 5 x 3
## # Rowwise:  nationality
##   nationality               data charts
##   <chr>       <list<tbl_df[,6]>> <list>
## 1 china                 [12 × 6] <gg>  
## 2 indonesia              [3 × 6] <gg>  
## 3 japanese             [135 × 6] <gg>  
## 4 nepal                 [26 × 6] <gg>  
## 5 vietnam               [24 × 6] <gg>

Create a chart for Chinese students.

scatters$charts[[1]]

Create a chart for Indonesian students.

scatters$charts[[2]]

V. Write a batch of csv files

出身国別のデータをファイルにします。

df %>%
  dplyr::nest_by(nationality) %>%
  dplyr::mutate(
    write.csv(data, paste0("nationality", nationality, ".csv"))
  )

## # A tibble: 5 x 2
## # Rowwise:  nationality
##   nationality               data
##   <chr>       <list<tbl_df[,6]>>
## 1 china                 [12 × 6]
## 2 indonesia              [3 × 6]
## 3 japanese             [135 × 6]
## 4 nepal                 [26 × 6]
## 5 vietnam               [24 × 6]

To be continued.