“dplyr” is one of the tidyverse packages, and it is used for data manipulation. In other words, it is a grammar of data manipulation providing verbs that help to solve many problems faced in data manipulation.
“stringir” in the other is also one of a tidyverse package, and it is focused on string manipulation. stringir is a grammar of string manipulation. It provides set of functions designed to make working with strings easier.
Note that tidyverse is just a collection of R packages underlying same design philosophy, grammar, and data structure. For more info about dplyr and stringir, read here.
Here, we are going to explore five major verbs for dplyr: filter(), select(), arrange(), mutate(), summarize()
and seven functions for stringr: str_subset(), str_detect(), str_count(), str_locate(), str_extract(), str_split(), str_replace(), str_match()
Throughout this part of the vignette, we make use of “student performance”, a dataset containing a sample of 1000 observations of 8 variables from kaggle datasets.
# library
library(dplyr)
# load in the dataset
data <- read.csv("https://raw.githubusercontent.com/jnataky/DATA-607/master/A2_Various_dataset_transformation/students_performance.csv")
# take a look at its structure
glimpse(data)
#> Rows: 1,000
#> Columns: 8
#> $ gender <chr> "female", "female", "female", "male", "...
#> $ race.ethnicity <chr> "group B", "group C", "group B", "group...
#> $ parental.level.of.education <chr> "bachelor's degree", "some college", "m...
#> $ lunch <chr> "standard", "standard", "standard", "fr...
#> $ test.preparation.course <chr> "none", "completed", "none", "none", "n...
#> $ math.score <int> 72, 69, 90, 47, 76, 71, 88, 40, 64, 38,...
#> $ reading.score <int> 72, 90, 95, 57, 78, 83, 95, 43, 64, 60,...
#> $ writing.score <int> 74, 88, 93, 44, 75, 78, 92, 39, 67, 50,...The filter function picks cases by their values. Let say you want to pick students which math score is 100, you might write:
data %>%
filter(math.score == 100)
#> gender race.ethnicity parental.level.of.education lunch
#> 1 male group E associate's degree free/reduced
#> 2 female group E some college standard
#> 3 female group E bachelor's degree standard
#> 4 male group A some college standard
#> 5 male group D some college standard
#> 6 male group E bachelor's degree standard
#> 7 female group E associate's degree standard
#> test.preparation.course math.score reading.score writing.score
#> 1 completed 100 100 93
#> 2 none 100 92 97
#> 3 none 100 100 100
#> 4 completed 100 96 86
#> 5 completed 100 97 99
#> 6 completed 100 100 100
#> 7 none 100 100 100Similarly, if you want to pick students which math, writing, and reading score are all 100, you might write:
data %>%
filter(math.score == 100, writing.score == 100, reading.score == 100)
#> gender race.ethnicity parental.level.of.education lunch
#> 1 female group E bachelor's degree standard
#> 2 male group E bachelor's degree standard
#> 3 female group E associate's degree standard
#> test.preparation.course math.score reading.score writing.score
#> 1 none 100 100 100
#> 2 completed 100 100 100
#> 3 none 100 100 100This second verb lets you select variables based on their names. Let say you want to select only the variables gender, math.score, and writing.score, you might write:
Arrange function lets reordering the cases in the order that you want. Let say that you want to reorder the head of previous selected variables data frame (data2) in descending order of students math score, you might write:
The mutate function creates new variables that are functions of the existing variables. Let say you want to create “english.score” which is the average of writing.score and reading.score, you might write:
data3 %>%
mutate(english.score = (writing.score + reading.score) / 2)
#> gender writing.score reading.score english.score
#> 1 female 74 72 73.0
#> 2 female 88 90 89.0
#> 3 female 93 95 94.0
#> 4 male 44 57 50.5
#> 5 male 75 78 76.5
#> 6 female 78 83 80.5If you are interesting in keeping only the new variable from the existing variables, let say you want to keep only english.score and not the two others, you might use another function called transmuse():
Here we will introduce the function group_by which helps grouping by the variable you want to do your summary. Let say you are interested in the summary of the average score of different gender in math, you might write:
The strings functions take for arguments one vector of strings and a second argument being the pattern. For this entire part of the vignette, we will use one vector of strings v and same pattern p which will be the regular expression matching any single character that is a vowel.
# library
library(stringr)
# Define the vector of strings
v <- c("sonority", "meal", "try", "cocktail", "cinema", "maximum", "mass")
# Pattern matching any vowel
p <- "[aeiou]"subset function will let you extract strings that contain vowels.
detect function detects if there is any match pattern. If you want to detect if there is any vowel in any components of v, you might write:
if you want to count the number of vowels in each components of the vector of strings, you might write:
locate function helps you locate where there is the match. Let say that you want to know the position of vowels in each component of v, you might write:
extract function lets you extracting the first match pattern in the string. The following will let you extracting the first vowel in each components of v:
Here we are going to split up strings separated by comma in different pieces
replace function let you replace the first match pattern by the replacement argument that you specify. If you want to replace the first vowel in components of v by “/”, you might write:
If you want to extract the letter before the first vowel in components of vector of strings v, you might write:
str_match(v, "(.)[aeiou]")
#> [,1] [,2]
#> [1,] "so" "s"
#> [2,] "me" "m"
#> [3,] NA NA
#> [4,] "co" "c"
#> [5,] "ci" "c"
#> [6,] "ma" "m"
#> [7,] "ma" "m"For more about tidyverse packages: