Dealing with String data

Authors
Affiliations

P K Parida

CRFM and ICAR

June Masters

CRFM

Published

August 21, 2024

1 Escape function in string data

when we want to use \ in the text charter then we have use \\ double back slace, as one is for escape and the 2nd one will be used as text character .

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
x <- c("x_sp\\sp","y_sp\\sp", "z_sp\\sp" )
str_view(x)
[1] │ x_sp\sp
[2] │ y_sp\sp
[3] │ z_sp\sp

suppose we want to use" in between the text charter, which already written in "" . the we have to use the escape function by using \ and then " , i.e. \" to get the end product of only ".

y <- c("x_sp\"sp","y_sp\\sp", "z_sp\\sp" )
str_view(y)
[1] │ x_sp"sp
[2] │ y_sp\sp
[3] │ z_sp\sp

2 str_c () function to combine strings

We will discus how to create a new variable of charter in the data set from the character variable

3 Running Code

library(gtExtras)
Loading required package: gt
data("iris")

Text data can be combined by using the function str_c() of tidyverse

first_name <- "Pranaya"
last_name <- "Parida"
str_c(first_name, last_name,sep =" ")
[1] "Pranaya Parida"

Now it will two variable into one with a space. It has 3 arguments, 1st the text /variable you want to add, 2nd the the text /variable, where you want to add and 3rd if any separator is you want to provide .

Creating data frame as name

first_names <- c("Pranaya", "Pradip")
last_names <- "Parida"
name <- as.data.frame(str_c(first_names, last_names,sep =" "))
name
  str_c(first_names, last_names, sep = " ")
1                            Pranaya Parida
2                             Pradip Parida

str_c() is very similar to the base paste0(), but is designed to be used with mutate() by obeying the usual tidyverse rules.

Let us work with iris data set. we are creating a new variable called Detail using str_c() function by adding the species name with the sepal length with a seprator of : and adding cm at last. we will see the data in a tabular format using gtextra package.

iris %>% 
  select(Species, Sepal.Length) %>% 
  mutate(Detail = 
              str_c(Species, 
                    ": ",
                    Sepal.Length,
                    "cm" )) %>%
  slice(1:10) %>% 
  gt() %>% 
  tab_header(title = "Iris data with new details ") %>% 
  cols_align(align ="left") %>% 
  gt_theme_pff()
Iris data with new details
Species Sepal.Length Detail
setosa 5.1 setosa: 5.1cm
setosa 4.9 setosa: 4.9cm
setosa 4.7 setosa: 4.7cm
setosa 4.6 setosa: 4.6cm
setosa 5.0 setosa: 5cm
setosa 5.4 setosa: 5.4cm
setosa 4.6 setosa: 4.6cm
setosa 5.0 setosa: 5cm
setosa 4.4 setosa: 4.4cm
setosa 4.9 setosa: 4.9cm

gtextras package is used for table preparation

If you want missing values to display use coalesce()to replace them. Depending on what you want, you might use it either inside or outside of str_c()

df <- tibble(name = c("Flora", "David", "Terra", NA))
df |> mutate(greeting = str_c("Hi ", name, "!"))
# A tibble: 4 × 2
  name  greeting 
  <chr> <chr>    
1 Flora Hi Flora!
2 David Hi David!
3 Terra Hi Terra!
4 <NA>  <NA>     
df |> 
  mutate(
    greeting1 = str_c("Hi ", coalesce(name, "you"), "!"),
    greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!")
  )
# A tibble: 4 × 3
  name  greeting1 greeting2
  <chr> <chr>     <chr>    
1 Flora Hi Flora! Hi Flora!
2 David Hi David! Hi David!
3 Terra Hi Terra! Hi Terra!
4 <NA>  Hi you!   Hi!      

4 Summarizing the text data

Now we will discuss about str_flatten , we can use this function to summarize a tabular format

df1 <- tribble(
  ~ name, ~ fruit,
  "Carmen", "banana",
  "Carmen", "apple",
  "Marvin", "nectarine",
  "Terence", "cantaloupe",
  "Terence", "papaya",
  "Terence", "mandarin"
)

df1 |>
  group_by(name) |> 
  summarize(fruits = str_flatten(fruit, ", ")) %>%
  gt() %>% 
  gt_theme_pff()
name fruits
Carmen banana, apple
Marvin nectarine
Terence cantaloupe, papaya, mandarin

we will get error , when we use only str_c() function.

df1 |>
  group_by(name) |> 
  summarize(fruits = str_c(fruit))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'name'. You can override using the
`.groups` argument.
# A tibble: 6 × 2
# Groups:   name [3]
  name    fruits    
  <chr>   <chr>     
1 Carmen  banana    
2 Carmen  apple     
3 Marvin  nectarine 
4 Terence cantaloupe
5 Terence papaya    
6 Terence mandarin  

5 str_detect() function

mtcars %>% 
  mutate(model = rownames(mtcars)) %>% 
  mutate(has_M = str_detect(model, 'M')) %>%   filter(has_M == TRUE) %>% 
  select(model, mpg, cyl, disp) %>%
  gt() %>% 
  gt_theme_538()
model mpg cyl disp
Mazda RX4 21.0 6 160.0
Mazda RX4 Wag 21.0 6 160.0
Merc 240D 24.4 4 146.7
Merc 230 22.8 4 140.8
Merc 280 19.2 6 167.6
Merc 280C 17.8 6 167.6
Merc 450SE 16.4 8 275.8
Merc 450SL 17.3 8 275.8
Merc 450SLC 15.2 8 275.8
AMC Javelin 15.2 8 304.0
Maserati Bora 15.0 8 301.0

5.1 str_glue()

If you are mixing many fixed and variable strings with str_c(), you’ll notice that you type a lot of "s, making it hard to see the overall goal of the code. An alternative approach is provided by the glue package via str_glue()4. You give it a single string that has a special feature: anything inside {} will be evaluated like it’s outside of the quotes:

df <- tibble(name = c("Flora", "David", "Terra", NA))
df |> mutate(greeting = str_glue("Hi {name}!"))
# A tibble: 4 × 2
  name  greeting 
  <chr> <glue>   
1 Flora Hi Flora!
2 David Hi David!
3 Terra Hi Terra!
4 <NA>  Hi NA!   

6 Extracting data from strings

It’s very common for multiple variables to be crammed together into a single string. In this section, you’ll learn how to use four tidyr functions to extract them:

  • df |> separate_longer_delim(col, delim)

  • df |> separate_longer_position(col, width)

  • df |> separate_wider_delim(col, delim, names)

  • df |> separate_wider_position(col, widths)

If you look closely, you can see there’s a common pattern here: separate_, then longer or wider, then _, then by delim or position.

  • Just like with pivot_longer() and pivot_wider(), _longer functions make the input data frame longer by creating new rows and _wider functions make the input data frame wider by generating new columns.

6.1 Separating in rows

Separating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring separate_longer_delim() to split based on a delimiter:

df1 <- tibble(x = c("a,b,c", "d,e", "f"))
df1 |> 
  separate_longer_delim(x, delim = ",")
# A tibble: 6 × 1
  x    
  <chr>
1 a    
2 b    
3 c    
4 d    
5 e    
6 f    

It’s rarer to see separate_longer_position() in the wild, but some older datasets do use a very compact format where each character is used to record a value:

df2 <- tibble(x = c("1211", "131", "21"))
df2 |> 
  separate_longer_position(x, width = 1)
# A tibble: 9 × 1
  x    
  <chr>
1 1    
2 2    
3 1    
4 1    
5 1    
6 3    
7 1    
8 2    
9 1    

6.2 Separating into columns

eparating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their longer equivalents because you need to name the columns. For example, in this following dataset, x is made up of a code, an edition number, and a year, separated by ".". To use separate_wider_delim(), we supply the delimiter and the names in two arguments:

df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
df3 |> 
  separate_wider_delim(
    x,
    delim = ".",
    names = c("code", "edition", "year")
  )
# A tibble: 3 × 3
  code  edition year 
  <chr> <chr>   <chr>
1 a10   1       2022 
2 b10   2       2011 
3 e15   1       2015 

If a specific piece is not useful you can use an NA name to omit it from the results:

df3 |> 
  separate_wider_delim(
    x,
    delim = ".",
    names = c("code", NA, "year")
  )
# A tibble: 3 × 2
  code  year 
  <chr> <chr>
1 a10   2022 
2 b10   2011 
3 e15   2015 

separate_wider_position() works a little differently because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column, and the value is the number of characters it occupies. You can omit values from the output by not naming them

df4 <- tibble(x = c("202215TX", "202122LA", "202325CA")) 
df4 |> 
  separate_wider_position(
    x,
    widths = c(year = 4, age = 2, state = 2)
  )
# A tibble: 3 × 3
  year  age   state
  <chr> <chr> <chr>
1 2022  15    TX   
2 2021  22    LA   
3 2023  25    CA   

6.3 Diagnosing widening problems

separate_wider_delim()6 requires a fixed and known set of columns. What happens if some of the rows don’t have the expected number of pieces? There are two possible problems, too few or too many pieces, so separate_wider_delim() provides two arguments to help: too_few and too_many. Let’s first look at the too_few case with the following sample dataset.

df <- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))

#df |> 
  #separate_wider_delim(
    #x,
    #delim = "-",
    #names = c("x", "y", "z") )
#> Error in `separate_wider_delim()`:
#> ! Expected 3 pieces in each element of `x`.
#> ! 2 values were too short.
#> ℹ Use `too_few = "debug"` to diagnose the problem.
#> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.

You’ll notice that we get an error, but the error gives us some suggestions on how you might proceed. Let’s start by debugging the problem:

df |> 
  separate_wider_delim(
    x,
    delim = "-",
    names = c("x", "y", "z"),
    too_few = "debug"
  )
Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
`x_remainder`.
# A tibble: 5 × 6
  x     y     z     x_ok  x_pieces x_remainder
  <chr> <chr> <chr> <lgl>    <int> <chr>      
1 1-1-1 1     1     TRUE         3 ""         
2 1-1-2 1     2     TRUE         3 ""         
3 1-3   3     <NA>  FALSE        2 ""         
4 1-3-2 3     2     TRUE         3 ""         
5 1     <NA>  <NA>  FALSE        1 ""         

When you use the debug mode, you get three extra columns added to the output: x_okx_pieces, and x_remainder (if you separate a variable with a different name, you’ll get a different prefix).

Sometimes looking at this debugging information will reveal a problem with your delimiter strategy or suggest that you need to do more preprocessing before separating. In that case, fix the problem upstream and make sure to remove too_few = "debug" to ensure that new problems become errors.

In other cases, you may want to fill in the missing pieces with NAs and move on. That’s the job of too_few = "align_start" and too_few = "align_end" which allow you to control where the NAs should go.

df |> 
  separate_wider_delim(
    x,
    delim = "-",
    names = c("x", "y", "z"),
    too_few = "align_start"
  )
# A tibble: 5 × 3
  x     y     z    
  <chr> <chr> <chr>
1 1     1     1    
2 1     1     2    
3 1     3     <NA> 
4 1     3     2    
5 1     <NA>  <NA> 

The same principles apply if you have too many pieces:

df <- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))
df |> 
  separate_wider_delim(
    x,
    delim = "-",
    names = c("x", "y", "z"),
    too_many = "debug"
  )
Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
`x_remainder`.
# A tibble: 5 × 6
  x         y     z     x_ok  x_pieces x_remainder
  <chr>     <chr> <chr> <lgl>    <int> <chr>      
1 1-1-1     1     1     TRUE         3 ""         
2 1-1-2     1     2     TRUE         3 ""         
3 1-3-5-6   3     5     FALSE        4 "-6"       
4 1-3-2     3     2     TRUE         3 ""         
5 1-3-5-7-9 3     5     FALSE        5 "-7-9"     

You have a slightly different set of options for handling too many pieces: you can either silently “drop” any additional pieces or “merge” them all into the final column:

df |> 
  separate_wider_delim(
    x,
    delim = "-",
    names = c("x", "y", "z"),
    too_many = "drop"
  )
# A tibble: 5 × 3
  x     y     z    
  <chr> <chr> <chr>
1 1     1     1    
2 1     1     2    
3 1     3     5    
4 1     3     2    
5 1     3     5    
df |> 
  separate_wider_delim(
    x,
    delim = "-",
    names = c("x", "y", "z"),
    too_many = "merge"
  )
# A tibble: 5 × 3
  x     y     z    
  <chr> <chr> <chr>
1 1     1     1    
2 1     1     2    
3 1     3     5-6  
4 1     3     2    
5 1     3     5-7-9

7 Length of character variables

str_length() tells you the number of letters in the string:

str_length(c("a", "R for data science", NA))
[1]  1 18 NA

You could use this with count() to find the distribution of lengths of US babynames and then with filter() to look at the longest names, which happen to have 15 letters:

library(babynames)
Warning: package 'babynames' was built under R version 4.3.3
babynames |>
  count(length = str_length(name), wt = n)
# A tibble: 14 × 2
   length        n
    <int>    <int>
 1      2   338150
 2      3  8589596
 3      4 48506739
 4      5 87011607
 5      6 90749404
 6      7 72120767
 7      8 25404066
 8      9 11926551
 9     10  1306159
10     11  2135827
11     12    16295
12     13    10845
13     14     3681
14     15      830
babynames |> 
  filter(str_length(name) == 15) |> 
  count(name, wt = n, sort = TRUE)
# A tibble: 34 × 2
   name                n
   <chr>           <int>
 1 Franciscojavier   123
 2 Christopherjohn   118
 3 Johnchristopher   118
 4 Christopherjame   108
 5 Christophermich    52
 6 Ryanchristopher    45
 7 Mariadelosangel    28
 8 Jonathanmichael    25
 9 Christianjoseph    22
10 Christopherjose    22
# ℹ 24 more rows

7.1 Subsetting

You can extract parts of a string using str_sub(string, start, end), where start and end are the positions where the substring should start and end. The start and end arguments are inclusive, so the length of the returned string will be end - start + 1

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
[1] "App" "Ban" "Pea"
#> [1] "App" "Ban" "Pea"

You can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.

str_sub(x, -3, -1)
[1] "ple" "ana" "ear"

You can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.

str_sub(x, -3, -1)
[1] "ple" "ana" "ear"

Note that str_sub() won’t fail if the string is too short: it will just return as much as possible:

str_sub("a", 1, 5)
[1] "a"

We could use str_sub() with mutate() to find the first and last letter of each name:

babynames |> 
  mutate(
    first = str_sub(name, 1, 1),
    last = str_sub(name, -1, -1)
  )
# A tibble: 1,924,665 × 7
    year sex   name          n   prop first last 
   <dbl> <chr> <chr>     <int>  <dbl> <chr> <chr>
 1  1880 F     Mary       7065 0.0724 M     y    
 2  1880 F     Anna       2604 0.0267 A     a    
 3  1880 F     Emma       2003 0.0205 E     a    
 4  1880 F     Elizabeth  1939 0.0199 E     h    
 5  1880 F     Minnie     1746 0.0179 M     e    
 6  1880 F     Margaret   1578 0.0162 M     t    
 7  1880 F     Ida        1472 0.0151 I     a    
 8  1880 F     Alice      1414 0.0145 A     e    
 9  1880 F     Bertha     1320 0.0135 B     a    
10  1880 F     Sarah      1288 0.0132 S     h    
# ℹ 1,924,655 more rows