stringr overview

Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparation tasks. The stringr package provide a cohesive set of functions designed to make working with strings as easy as possible. If you’re not familiar with strings, the best place to start is the chapter on strings in R for Data Science.

stringr is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations. stringr focusses on the most important and commonly used string manipulation functions whereas stringi provides a comprehensive set covering almost anything you can imagine. If you find that stringr is missing a function that you need, try looking in stringi. Both packages share similar conventions, so once you’ve mastered stringr, you should find stringi similarly easy to use. Usage All functions in stringr start with str_ and take a vector of strings as the first argument.

library(stringr, warn.conflicts=F, quietly=T)  #or we can attach tidyverse
x <- c("why", "video", "cross", "extra", "deal", "authority")
str_length(x) 
## [1] 3 5 5 5 4 9
str_c(x, collapse = ", ")
## [1] "why, video, cross, extra, deal, authority"
str_sub(x, 1, 2)
## [1] "wh" "vi" "cr" "ex" "de" "au"

Most string functions work with regular expressions, a concise language for describing patterns of text. For example, the regular expression “[aeiou]” matches any single character that is a vowel:

str_subset(x, "[aeiou]")
## [1] "video"     "cross"     "extra"     "deal"      "authority"

Main useful patterns of stringr package

There are several main verbs that work with patterns:

x<-c("Tural", "Said", "Ali","Sami","Nagi", "Seymur")
str_length(x)
## [1] 5 4 3 4 4 6
str_to_upper(x)
## [1] "TURAL"  "SAID"   "ALI"    "SAMI"   "NAGI"   "SEYMUR"
str_to_lower(x)
## [1] "tural"  "said"   "ali"    "sami"   "nagi"   "seymur"
str_to_sentence(x)
## [1] "Tural"  "Said"   "Ali"    "Sami"   "Nagi"   "Seymur"
str_trim("    Trim this text  ")   #trims the white spaces
## [1] "Trim this text"
str_pad("Test", width=10, side="both")   #pads both sides with whitespace
## [1] "   Test   "
str_trunc("This is some long sentence ....", width=10)   #truncate string in the given length
## [1] "This is..."

Now lets do some cleaning and manupilation tricks with our dataset which we have fetched from turbo.az car sell website. Because, we need this dataset to plot regression models to draw out insights between price and year (or engine, mileage) First, lets make this dataset by the following code block.

library(rvest)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
link="https://turbo.az/autos"
page=read_html(link)

car_name=page %>% html_nodes("div.products-i__name.products-i__bottom-text") %>% html_text()
car_info=page %>% html_nodes("div.products-i__attributes.products-i__bottom-text") %>%html_text()
car_price=page%>%html_nodes("div.product-price")%>%html_text()
ad_time=page %>% html_nodes("div.products-i__datetime") %>% html_text()

car_data=data.frame(car_name, car_info, car_price, ad_time, stringsAsFactors = FALSE)
#row.names(car_data)<-NULL

head(car_data)
##            car_name                car_info  car_price                  ad_time
## 1        Honda M-NV       2022, 0.0 L, 0 km 61 100 AZN Baki, dün<U+0259>n 15:19
## 2   DongFeng EQ 220       2022, 0.0 L, 0 km 29 500 AZN   Baki, 15.10.2022 14:19
## 3 Chery Tiggo 2 Pro       2022, 1.5 L, 0 km 30 900 AZN   Baki, 18.10.2022 11:48
## 4   Volkswagen ID.4       2022, 0.0 L, 0 km   44 500 $ Baki, dün<U+0259>n 15:24
## 5   Hyundai Elantra  2019, 2.0 L, 43 200 km 34 500 AZN Baki, dün<U+0259>n 12:29
## 6       Ford Fiesta 2016, 1.6 L, 162 544 km 14 300 AZN        Baki, bugün 10:02

Join and Split strings

Now lets start to make our dataset more useful. Do these code blocks in your Rstudio and inspect the result step by step.

#Lets, divide car info column into the three columns: year, engine and mileage
head(str_split(car_data$car_info, ","))
## [[1]]
## [1] "2022"   " 0.0 L" " 0 km" 
## 
## [[2]]
## [1] "2022"   " 0.0 L" " 0 km" 
## 
## [[3]]
## [1] "2022"   " 1.5 L" " 0 km" 
## 
## [[4]]
## [1] "2022"   " 0.0 L" " 0 km" 
## 
## [[5]]
## [1] "2019"       " 2.0 L"     " 43 200 km"
## 
## [[6]]
## [1] "2016"        " 1.6 L"      " 162 544 km"
#Reverse of split function is str_c:  str_c(c("Turn ", "me ", "into ",  "one sentence"), collapse="")

#Replacing any character in a string
str_replace_all(car_data$car_price,"AZN","")
##  [1] "61 100 "  "29 500 "  "30 900 "  "44 500 $" "34 500 "  "14 300 " 
##  [7] "33 800 "  "31 500 $" "15 000 "  "12 000 "  "2 200 "   "26 500 " 
## [13] "24 700 $" "43 900 $" "44 900 $" "16 300 "  "35 900 "  "8 200 "  
## [19] "42 900 $" "27 800 "  "40 500 $" "6 400 "   "6 000 "   "52 900 $"
## [25] "45 000 "  "4 500 "   "49 500 $" "38 900 "  "10 900 "  "9 000 "  
## [31] "11 000 "  "29 900 "  "9 200 "   "45 900 $" "17 800 "  "40 800 $"

Sorting strings

#First try str_order function, it gives us sorted indexes. Then use below code to order any dataset
car_data$car_name[str_order(car_data$car_name)]
##  [1] "Audi A6"                 "Chery Tiggo 2 Pro"      
##  [3] "Chevrolet Cruze"         "DongFeng EQ 220"        
##  [5] "Ford Fiesta"             "Honda M-NV"             
##  [7] "Hyundai Elantra"         "Hyundai Santa Fe"       
##  [9] "Hyundai Sonata"          "Hyundai Venue"          
## [11] "Kia Sorento"             "Kia Sportage"           
## [13] "LADA (VAZ) 2102"         "LADA (VAZ) 2104"        
## [15] "LADA (VAZ) 2107"         "LADA (VAZ) 2107"        
## [17] "LADA (VAZ) 21099"        "Lexus GX 460"           
## [19] "Mercedes CLS 350"        "Mercedes E 200"         
## [21] "Mercedes E 240"          "Mercedes E 320"         
## [23] "Mercedes G 500"          "Mercedes Vito 111"      
## [25] "Mini Cooper"             "Mitsubishi Pajero"      
## [27] "Mitsubishi Pajero Sport" "Mitsubishi Pajero Sport"
## [29] "Opel Zafira"             "Opel Zafira"            
## [31] "Peugeot 407"             "Toyota Avalon"          
## [33] "Toyota Land Cruiser"     "Toyota Land Cruiser"    
## [35] "Toyota Prado"            "Volkswagen ID.4"

String matching

price<-str_detect(car_data$car_price, pattern="\\$")  #Shows if this pattern exists in the string" 
price
##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [13]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE
## [25] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
price_indicies<-str_which(car_data$car_price, pattern="\\$")   #Now lets get indicies of the matched strings
price_indicies
##  [1]  4  8 13 14 15 19 21 24 27 34 36
car_data$car_price[price_indicies] #Lets extract matched strings using detected indicies
##  [1] "44 500 $" "31 500 $" "24 700 $" "43 900 $" "44 900 $" "42 900 $"
##  [7] "40 500 $" "52 900 $" "49 500 $" "45 900 $" "40 800 $"

Substring and Replace

Now we are going to get substring of our dataset sample. Assume we want to fetch such data in which car prices is expensive than 15000. First, we have to convert price column from character to numeric.

str_subset(car_data$car_name,"Kia")
## [1] "Kia Sorento"  "Kia Sportage"
#Some tricks with sub_extract function-------:"
str <- c("G1:E001", "G2:E002", "G3:E003")
print(str)
## [1] "G1:E001" "G2:E002" "G3:E003"
str_extract(string = str, pattern = "E[0-9]+")
## [1] "E001" "E002" "E003"
str_remove(string = str, pattern = "^.*:")   #"Replace leading string with "
## [1] "E001" "E002" "E003"

Lets divide car_price column into price and currency

gsub("[^0-9.-]", "", car_data$car_price)
##  [1] "61100" "29500" "30900" "44500" "34500" "14300" "33800" "31500" "15000"
## [10] "12000" "2200"  "26500" "24700" "43900" "44900" "16300" "35900" "8200" 
## [19] "42900" "27800" "40500" "6400"  "6000"  "52900" "45000" "4500"  "49500"
## [28] "38900" "10900" "9000"  "11000" "29900" "9200"  "45900" "17800" "40800"
gsub("[[:digit:]]", "", car_data$car_price)
##  [1] "  AZN" "  AZN" "  AZN" "  $"   "  AZN" "  AZN" "  AZN" "  $"   "  AZN"
## [10] "  AZN" "  AZN" "  AZN" "  $"   "  $"   "  $"   "  AZN" "  AZN" "  AZN"
## [19] "  $"   "  AZN" "  $"   "  AZN" "  AZN" "  $"   "  AZN" "  AZN" "  $"  
## [28] "  AZN" "  AZN" "  AZN" "  AZN" "  AZN" "  AZN" "  $"   "  AZN" "  $"

Replace function

Lets learn 2 very useful functions about replacing characters in a string.

head(car_data$car_price)
## [1] "61 100 AZN" "29 500 AZN" "30 900 AZN" "44 500 $"   "34 500 AZN"
## [6] "14 300 AZN"
car_data%>%select(car_price)%>%
  mutate(test=sub(pattern="AZN", replacement = "Manat",x=car_price))%>%head()
##    car_price         test
## 1 61 100 AZN 61 100 Manat
## 2 29 500 AZN 29 500 Manat
## 3 30 900 AZN 30 900 Manat
## 4   44 500 $     44 500 $
## 5 34 500 AZN 34 500 Manat
## 6 14 300 AZN 14 300 Manat
car_data%>%select(car_price)%>%
  mutate(test=str_replace(pattern="\\$", replacement = "Dollar",string=car_price))%>%head()
##    car_price          test
## 1 61 100 AZN    61 100 AZN
## 2 29 500 AZN    29 500 AZN
## 3 30 900 AZN    30 900 AZN
## 4   44 500 $ 44 500 Dollar
## 5 34 500 AZN    34 500 AZN
## 6 14 300 AZN    14 300 AZN
#Classwork, tell me difference between str_replace() and str_replace_all()

Viewing Strings

#Returns car_name columns first 6 rows"
car_data$car_name[1:6]  
## [1] "Honda M-NV"        "DongFeng EQ 220"   "Chery Tiggo 2 Pro"
## [4] "Volkswagen ID.4"   "Hyundai Elantra"   "Ford Fiesta"
#Returns data as html text with highlighted pattern"
#str_view(car_data$car_name,pattern="Kia")

Classwork:

  1. Practice with strings https://r4ds.had.co.nz/strings.html
  2. More powerfull package: https://r4ds.had.co.nz/strings.html#stringi
  3. Clean data from bina.az.csv (a datset made by mr. Cavad)
  4. Make some regular expression statements : Link here