Vignette: Getting to know dplyr & stringr

Jered Ataky

2020-10-25

1. Introduction

“dplyr” is one of the tidyverse packages, and it is used for data manipulation. In other words, it is a grammar of data manipulation providing verbs that help to solve many problems faced in data manipulation.

“stringir” in the other is also one of a tidyverse package, and it is focused on string manipulation. stringir is a grammar of string manipulation. It provides set of functions designed to make working with strings easier.

Note that tidyverse is just a collection of R packages underlying same design philosophy, grammar, and data structure. For more info about dplyr and stringir, read here.

Here, we are going to explore five major verbs for dplyr: filter(), select(), arrange(), mutate(), summarize()

and seven functions for stringr: str_subset(), str_detect(), str_count(), str_locate(), str_extract(), str_split(), str_replace(), str_match()

2. dplyr

Throughout this part of the vignette, we make use of “student performance”, a dataset containing a sample of 1000 observations of 8 variables from kaggle datasets.

# library
library(dplyr)

# load in the dataset

data <- read.csv("https://raw.githubusercontent.com/jnataky/DATA-607/master/A2_Various_dataset_transformation/students_performance.csv")

# take a look at its structure
glimpse(data)
#> Rows: 1,000
#> Columns: 8
#> $ gender                      <chr> "female", "female", "female", "male", "...
#> $ race.ethnicity              <chr> "group B", "group C", "group B", "group...
#> $ parental.level.of.education <chr> "bachelor's degree", "some college", "m...
#> $ lunch                       <chr> "standard", "standard", "standard", "fr...
#> $ test.preparation.course     <chr> "none", "completed", "none", "none", "n...
#> $ math.score                  <int> 72, 69, 90, 47, 76, 71, 88, 40, 64, 38,...
#> $ reading.score               <int> 72, 90, 95, 57, 78, 83, 95, 43, 64, 60,...
#> $ writing.score               <int> 74, 88, 93, 44, 75, 78, 92, 39, 67, 50,...

filter(): picking cases by their values

The filter function picks cases by their values. Let say you want to pick students which math score is 100, you might write:


data %>%
  filter(math.score == 100)
#>   gender race.ethnicity parental.level.of.education        lunch
#> 1   male        group E          associate's degree free/reduced
#> 2 female        group E                some college     standard
#> 3 female        group E           bachelor's degree     standard
#> 4   male        group A                some college     standard
#> 5   male        group D                some college     standard
#> 6   male        group E           bachelor's degree     standard
#> 7 female        group E          associate's degree     standard
#>   test.preparation.course math.score reading.score writing.score
#> 1               completed        100           100            93
#> 2                    none        100            92            97
#> 3                    none        100           100           100
#> 4               completed        100            96            86
#> 5               completed        100            97            99
#> 6               completed        100           100           100
#> 7                    none        100           100           100

Similarly, if you want to pick students which math, writing, and reading score are all 100, you might write:


data %>%
  filter(math.score == 100, writing.score == 100, reading.score == 100)
#>   gender race.ethnicity parental.level.of.education    lunch
#> 1 female        group E           bachelor's degree standard
#> 2   male        group E           bachelor's degree standard
#> 3 female        group E          associate's degree standard
#>   test.preparation.course math.score reading.score writing.score
#> 1                    none        100           100           100
#> 2               completed        100           100           100
#> 3                    none        100           100           100

select(): selecting variables based on their names

This second verb lets you select variables based on their names. Let say you want to select only the variables gender, math.score, and writing.score, you might write:


data2 <- data %>%
  select(gender, writing.score, reading.score)

# Print five first students
head(data2, 6)
#>   gender writing.score reading.score
#> 1 female            74            72
#> 2 female            88            90
#> 3 female            93            95
#> 4   male            44            57
#> 5   male            75            78
#> 6 female            78            83

arrange(): Reordering the cases

Arrange function lets reordering the cases in the order that you want. Let say that you want to reorder the head of previous selected variables data frame (data2) in descending order of students math score, you might write:


# name the head of data2 as data3 

data3 <- head(data2, 6)

# Reorder in descending order of math score

data3 %>%
  arrange(desc(writing.score))
#>   gender writing.score reading.score
#> 1 female            93            95
#> 2 female            88            90
#> 3 female            78            83
#> 4   male            75            78
#> 5 female            74            72
#> 6   male            44            57

mutate(): creating new variables that are functions of existing variables

The mutate function creates new variables that are functions of the existing variables. Let say you want to create “english.score” which is the average of writing.score and reading.score, you might write:


data3 %>%
  mutate(english.score = (writing.score + reading.score) / 2)
#>   gender writing.score reading.score english.score
#> 1 female            74            72          73.0
#> 2 female            88            90          89.0
#> 3 female            93            95          94.0
#> 4   male            44            57          50.5
#> 5   male            75            78          76.5
#> 6 female            78            83          80.5

If you are interesting in keeping only the new variable from the existing variables, let say you want to keep only english.score and not the two others, you might use another function called transmuse():


data4 <- data3 %>%
  transmute(gender, english.score = (writing.score + reading.score) / 2)

data4
#>   gender english.score
#> 1 female          73.0
#> 2 female          89.0
#> 3 female          94.0
#> 4   male          50.5
#> 5   male          76.5
#> 6 female          80.5

summarize(): summarizing multiple values to a single value

Here we will introduce the function group_by which helps grouping by the variable you want to do your summary. Let say you are interested in the summary of the average score of different gender in math, you might write:


data %>%
  group_by(gender) %>%
  summarize( math_score = sum (math.score)/ n())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#>   gender math_score
#>   <chr>       <dbl>
#> 1 female       63.6
#> 2 male         68.7

3. stringr

The strings functions take for arguments one vector of strings and a second argument being the pattern. For this entire part of the vignette, we will use one vector of strings v and same pattern p which will be the regular expression matching any single character that is a vowel.


# library

library(stringr)

# Define the vector of strings

v <- c("sonority", "meal", "try", "cocktail", "cinema", "maximum", "mass")

# Pattern matching any vowel

p <- "[aeiou]"

str_subset(): extracting the matching components

subset function will let you extract strings that contain vowels.


str_subset(v, p)
#> [1] "sonority" "meal"     "cocktail" "cinema"   "maximum"  "mass"

str_detect(): telling if there is any pattern matching

detect function detects if there is any match pattern. If you want to detect if there is any vowel in any components of v, you might write:


str_detect(v, p)
#> [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE

str_count(): counting the patterns

if you want to count the number of vowels in each components of the vector of strings, you might write:


str_count(v, p)
#> [1] 3 2 0 3 3 3 1

str_locate(): locating the position of the match

locate function helps you locate where there is the match. Let say that you want to know the position of vowels in each component of v, you might write:


str_locate(v, p)
#>      start end
#> [1,]     2   2
#> [2,]     2   2
#> [3,]    NA  NA
#> [4,]     2   2
#> [5,]     2   2
#> [6,]     2   2
#> [7,]     2   2

str_extract(): extracting the text of the match

extract function lets you extracting the first match pattern in the string. The following will let you extracting the first vowel in each components of v:


str_extract(v, p)
#> [1] "o" "e" NA  "o" "i" "a" "a"

str_split(): splitting up strings

Here we are going to split up strings separated by comma in different pieces


s <- c("dada, mum", "uncle, auntie, cousin", "men, women")
str_split(s, ",")
#> [[1]]
#> [1] "dada" " mum"
#> 
#> [[2]]
#> [1] "uncle"   " auntie" " cousin"
#> 
#> [[3]]
#> [1] "men"    " women"

str_replace(): replacing the matches with new text

replace function let you replace the first match pattern by the replacement argument that you specify. If you want to replace the first vowel in components of v by “/”, you might write:


str_replace(v, p, "/")
#> [1] "s/nority" "m/al"     "try"      "c/cktail" "c/nema"   "m/ximum"  "m/ss"

str_match(): extracting parts of the match defined by parentheses

If you want to extract the letter before the first vowel in components of vector of strings v, you might write:


str_match(v, "(.)[aeiou]")
#>      [,1] [,2]
#> [1,] "so" "s" 
#> [2,] "me" "m" 
#> [3,] NA   NA  
#> [4,] "co" "c" 
#> [5,] "ci" "c" 
#> [6,] "ma" "m" 
#> [7,] "ma" "m"

For more about tidyverse packages:

R for Data Science