We’ll learn the things you don’t expect you can do with dplyr.
Keith McNulty, Five thinngs you never knew you could do with dplyr
First you have to find out the version of your dplyr package. You will learn some functions that are either new to dplyr 1.0.0+ or substantially improved.
This is how you confirm the version.
library("dplyr")
packageVersion("dplyr")
## [1] '1.0.5'
1.0.5 is the latest version.
We should know how to use rowwise( ) before we begin with the five things that we will learn here.
Let’s take a look at some examples.
Load the sample data.
# for getting the data via URL
library(readr)
# load the data
df = read_csv("https://pastebin.com/raw/nWkAe1qR")
head(df)
## # A tibble: 6 x 7
## id english japanese nationality department year gender
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 1 17.8 75.6 japanese literature 1 male
## 2 2 64.4 53.3 nepal literature 1 male
## 3 3 86.7 31.1 nepal literature 1 male
## 4 4 60 62.2 indonesia literature 1 male
## 5 5 42.2 80 japanese literature 1 male
## 6 6 33.3 75.6 japanese literature 1 male
If you wanted to take the higher marks of English and Japanese, you might naturally try this. However, it is not what you intend to get.
df %>%
dplyr::mutate(max = max(english, japanese)) %>%
head()
## # A tibble: 6 x 8
## id english japanese nationality department year gender max
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 17.8 75.6 japanese literature 1 male 93.3
## 2 2 64.4 53.3 nepal literature 1 male 93.3
## 3 3 86.7 31.1 nepal literature 1 male 93.3
## 4 4 60 62.2 indonesia literature 1 male 93.3
## 5 5 42.2 80 japanese literature 1 male 93.3
## 6 6 33.3 75.6 japanese literature 1 male 93.3
You should use dplyr in the row-wise way.
df %>%
rowwise() %>%
mutate(max = max(english, japanese)) %>%
head()
## # A tibble: 6 x 8
## # Rowwise:
## id english japanese nationality department year gender max
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 17.8 75.6 japanese literature 1 male 75.6
## 2 2 64.4 53.3 nepal literature 1 male 64.4
## 3 3 86.7 31.1 nepal literature 1 male 86.7
## 4 4 60 62.2 indonesia literature 1 male 62.2
## 5 5 42.2 80 japanese literature 1 male 80
## 6 6 33.3 75.6 japanese literature 1 male 75.6
You should also know how to use nest_by().
Let’s take a look at an example with the above data.
This function is a shortcut function that creates rows of nested data. Here you will get the five lists of the lists grouped by nationalities.
df %>%
nest_by(nationality)
## # A tibble: 5 x 2
## # Rowwise: nationality
## nationality data
## <chr> <list<tbl_df[,6]>>
## 1 china [12 × 6]
## 2 indonesia [3 × 6]
## 3 japanese [135 × 6]
## 4 nepal [26 × 6]
## 5 vietnam [24 × 6]
You can get tidy model outputs with the broom package.
library(broom)
broom::glance(
lm(Volume ~ Girth + Height, trees)
)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.948 0.944 3.88 255. 1.07e-18 2 -84.5 177. 183.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
You will get all the fit statistics with a rowwise mutate of broom in the following way. Now we use a sample data set called trees.
# create a column with model formulas to test
models <- data.frame(
formula = c(
"Volume ~ Girth",
"Volume ~ Girth + Height"
)
)
# run them all and get fit statistics
models %>%
dplyr::rowwise() %>%
dplyr::mutate(
broom::glance(lm(formula, trees))
)
## # A tibble: 2 x 13
## # Rowwise:
## formula r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Volume… 0.935 0.933 4.25 419. 8.64e-19 1 -87.8 182.
## 2 Volume… 0.948 0.944 3.88 255. 1.07e-18 2 -84.5 177.
## # … with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## # nobs <int>
Back to the English-Japanese exam, you can create charts by nationality with nest_by and mutate.
library(ggplot2)
scatters <- df %>%
dplyr::nest_by(nationality) %>%
dplyr::mutate(
charts = list(
ggplot(data, aes(x = english, y = japanese)) +
geom_point()
)
)
print(scatters)
## # A tibble: 5 x 3
## # Rowwise: nationality
## nationality data charts
## <chr> <list<tbl_df[,6]>> <list>
## 1 china [12 × 6] <gg>
## 2 indonesia [3 × 6] <gg>
## 3 japanese [135 × 6] <gg>
## 4 nepal [26 × 6] <gg>
## 5 vietnam [24 × 6] <gg>
Create a chart for Chinese students.
scatters$charts[[1]]
Create a chart for Indonesian students.
scatters$charts[[2]]
出身国別のデータをファイルにします。
df %>%
dplyr::nest_by(nationality) %>%
dplyr::mutate(
write.csv(data, paste0("nationality", nationality, ".csv"))
)
## # A tibble: 5 x 2
## # Rowwise: nationality
## nationality data
## <chr> <list<tbl_df[,6]>>
## 1 china [12 × 6]
## 2 indonesia [3 × 6]
## 3 japanese [135 × 6]
## 4 nepal [26 × 6]
## 5 vietnam [24 × 6]
To be continued.