when we want to use \ in the text charter then we have use \\ double back slace, as one is for escape and the 2nd one will be used as text character .
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
x <-c("x_sp\\sp","y_sp\\sp", "z_sp\\sp" )str_view(x)
[1] │ x_sp\sp
[2] │ y_sp\sp
[3] │ z_sp\sp
suppose we want to use" in between the text charter, which already written in "" . the we have to use the escape function by using \ and then " , i.e. \" to get the end product of only ".
y <-c("x_sp\"sp","y_sp\\sp", "z_sp\\sp" )str_view(y)
[1] │ x_sp"sp
[2] │ y_sp\sp
[3] │ z_sp\sp
2str_c () function to combine strings
We will discus how to create a new variable of charter in the data set from the character variable
3 Running Code
library(gtExtras)
Loading required package: gt
data("iris")
Text data can be combined by using the function str_c() of tidyverse
Now it will two variable into one with a space. It has 3 arguments, 1st the text /variable you want to add, 2nd the the text /variable, where you want to add and 3rd if any separator is you want to provide .
str_c() is very similar to the base paste0(), but is designed to be used with mutate() by obeying the usual tidyverse rules.
Let us work with iris data set. we are creating a new variable called Detail using str_c() function by adding the species name with the sepal length with a seprator of : and adding cm at last. we will see the data in a tabular format using gtextra package.
iris %>%select(Species, Sepal.Length) %>%mutate(Detail =str_c(Species, ": ", Sepal.Length,"cm" )) %>%slice(1:10) %>%gt() %>%tab_header(title ="Iris data with new details ") %>%cols_align(align ="left") %>%gt_theme_pff()
Iris data with new details
Species
Sepal.Length
Detail
setosa
5.1
setosa: 5.1cm
setosa
4.9
setosa: 4.9cm
setosa
4.7
setosa: 4.7cm
setosa
4.6
setosa: 4.6cm
setosa
5.0
setosa: 5cm
setosa
5.4
setosa: 5.4cm
setosa
4.6
setosa: 4.6cm
setosa
5.0
setosa: 5cm
setosa
4.4
setosa: 4.4cm
setosa
4.9
setosa: 4.9cm
gtextras package is used for table preparation
If you want missing values to display use coalesce()to replace them. Depending on what you want, you might use it either inside or outside of str_c()
# A tibble: 4 × 3
name greeting1 greeting2
<chr> <chr> <chr>
1 Flora Hi Flora! Hi Flora!
2 David Hi David! Hi David!
3 Terra Hi Terra! Hi Terra!
4 <NA> Hi you! Hi!
4 Summarizing the text data
Now we will discuss about str_flatten , we can use this function to summarize a tabular format
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'name'. You can override using the
`.groups` argument.
# A tibble: 6 × 2
# Groups: name [3]
name fruits
<chr> <chr>
1 Carmen banana
2 Carmen apple
3 Marvin nectarine
4 Terence cantaloupe
5 Terence papaya
6 Terence mandarin
If you are mixing many fixed and variable strings with str_c(), you’ll notice that you type a lot of "s, making it hard to see the overall goal of the code. An alternative approach is provided by the glue package via str_glue()4. You give it a single string that has a special feature: anything inside {} will be evaluated like it’s outside of the quotes:
# A tibble: 4 × 2
name greeting
<chr> <glue>
1 Flora Hi Flora!
2 David Hi David!
3 Terra Hi Terra!
4 <NA> Hi NA!
6Extracting data from strings
It’s very common for multiple variables to be crammed together into a single string. In this section, you’ll learn how to use four tidyr functions to extract them:
df |> separate_longer_delim(col, delim)
df |> separate_longer_position(col, width)
df |> separate_wider_delim(col, delim, names)
df |> separate_wider_position(col, widths)
If you look closely, you can see there’s a common pattern here: separate_, then longer or wider, then _, then by delim or position.
Just like with pivot_longer() and pivot_wider(), _longer functions make the input data frame longer by creating new rows and _wider functions make the input data frame wider by generating new columns.
6.1 Separating in rows
Separating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring separate_longer_delim() to split based on a delimiter:
It’s rarer to see separate_longer_position() in the wild, but some older datasets do use a very compact format where each character is used to record a value:
eparating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their longer equivalents because you need to name the columns. For example, in this following dataset, x is made up of a code, an edition number, and a year, separated by ".". To use separate_wider_delim(), we supply the delimiter and the names in two arguments:
# A tibble: 3 × 2
code year
<chr> <chr>
1 a10 2022
2 b10 2011
3 e15 2015
separate_wider_position() works a little differently because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column, and the value is the number of characters it occupies. You can omit values from the output by not naming them
df4 <-tibble(x =c("202215TX", "202122LA", "202325CA")) df4 |>separate_wider_position( x,widths =c(year =4, age =2, state =2) )
# A tibble: 3 × 3
year age state
<chr> <chr> <chr>
1 2022 15 TX
2 2021 22 LA
3 2023 25 CA
6.3Diagnosing widening problems
separate_wider_delim()6 requires a fixed and known set of columns. What happens if some of the rows don’t have the expected number of pieces? There are two possible problems, too few or too many pieces, so separate_wider_delim() provides two arguments to help: too_few and too_many. Let’s first look at the too_few case with the following sample dataset.
df <-tibble(x =c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))#df |> #separate_wider_delim(#x,#delim = "-",#names = c("x", "y", "z") )#> Error in `separate_wider_delim()`:#> ! Expected 3 pieces in each element of `x`.#> ! 2 values were too short.#> ℹ Use `too_few = "debug"` to diagnose the problem.#> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
You’ll notice that we get an error, but the error gives us some suggestions on how you might proceed. Let’s start by debugging the problem:
When you use the debug mode, you get three extra columns added to the output: x_ok, x_pieces, and x_remainder (if you separate a variable with a different name, you’ll get a different prefix).
Sometimes looking at this debugging information will reveal a problem with your delimiter strategy or suggest that you need to do more preprocessing before separating. In that case, fix the problem upstream and make sure to remove too_few = "debug" to ensure that new problems become errors.
In other cases, you may want to fill in the missing pieces with NAs and move on. That’s the job of too_few = "align_start" and too_few = "align_end" which allow you to control where the NAs should go.
You have a slightly different set of options for handling too many pieces: you can either silently “drop” any additional pieces or “merge” them all into the final column:
# A tibble: 5 × 3
x y z
<chr> <chr> <chr>
1 1 1 1
2 1 1 2
3 1 3 5-6
4 1 3 2
5 1 3 5-7-9
7 Length of character variables
str_length() tells you the number of letters in the string:
str_length(c("a", "R for data science", NA))
[1] 1 18 NA
You could use this with count() to find the distribution of lengths of US babynames and then with filter() to look at the longest names, which happen to have 15 letters:
library(babynames)
Warning: package 'babynames' was built under R version 4.3.3
You can extract parts of a string using str_sub(string, start, end), where start and end are the positions where the substring should start and end. The start and end arguments are inclusive, so the length of the returned string will be end - start + 1
x <-c("Apple", "Banana", "Pear")str_sub(x, 1, 3)
[1] "App" "Ban" "Pea"
#> [1] "App" "Ban" "Pea"
You can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.
str_sub(x, -3, -1)
[1] "ple" "ana" "ear"
You can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.
str_sub(x, -3, -1)
[1] "ple" "ana" "ear"
Note that str_sub() won’t fail if the string is too short: it will just return as much as possible:
str_sub("a", 1, 5)
[1] "a"
We could use str_sub() with mutate() to find the first and last letter of each name:
# A tibble: 1,924,665 × 7
year sex name n prop first last
<dbl> <chr> <chr> <int> <dbl> <chr> <chr>
1 1880 F Mary 7065 0.0724 M y
2 1880 F Anna 2604 0.0267 A a
3 1880 F Emma 2003 0.0205 E a
4 1880 F Elizabeth 1939 0.0199 E h
5 1880 F Minnie 1746 0.0179 M e
6 1880 F Margaret 1578 0.0162 M t
7 1880 F Ida 1472 0.0151 I a
8 1880 F Alice 1414 0.0145 A e
9 1880 F Bertha 1320 0.0135 B a
10 1880 F Sarah 1288 0.0132 S h
# ℹ 1,924,655 more rows