stringr overview
Strings are not glamorous, high-profile components of R, but they do
play a big role in many data cleaning and preparation tasks. The stringr
package provide a cohesive set of functions designed to make working
with strings as easy as possible. If you’re not familiar with strings,
the best place to start is the chapter on strings in R for Data
Science.
stringr is built on top of stringi, which uses the ICU C library to
provide fast, correct implementations of common string manipulations.
stringr focusses on the most important and commonly used string
manipulation functions whereas stringi provides a comprehensive set
covering almost anything you can imagine. If you find that stringr is
missing a function that you need, try looking in stringi. Both packages
share similar conventions, so once you’ve mastered stringr, you should
find stringi similarly easy to use. Usage All functions in stringr start
with str_ and take a vector of strings as the first argument.
library(stringr, warn.conflicts=F, quietly=T) #or we can attach tidyverse
x <- c("why", "video", "cross", "extra", "deal", "authority")
str_length(x)
## [1] 3 5 5 5 4 9
str_c(x, collapse = ", ")
## [1] "why, video, cross, extra, deal, authority"
str_sub(x, 1, 2)
## [1] "wh" "vi" "cr" "ex" "de" "au"
Most string functions work with regular expressions, a concise
language for describing patterns of text. For example, the regular
expression “[aeiou]” matches any single character that is a vowel:
str_subset(x, "[aeiou]")
## [1] "video" "cross" "extra" "deal" "authority"
Main useful patterns of stringr package
There are several main verbs that work with patterns:
x<-c("Tural", "Said", "Ali","Sami","Nagi", "Seymur")
str_length(x)
## [1] 5 4 3 4 4 6
str_to_upper(x)
## [1] "TURAL" "SAID" "ALI" "SAMI" "NAGI" "SEYMUR"
str_to_lower(x)
## [1] "tural" "said" "ali" "sami" "nagi" "seymur"
str_to_sentence(x)
## [1] "Tural" "Said" "Ali" "Sami" "Nagi" "Seymur"
str_trim(" Trim this text ") #trims the white spaces
## [1] "Trim this text"
str_pad("Test", width=10, side="both") #pads both sides with whitespace
## [1] " Test "
str_trunc("This is some long sentence ....", width=10) #truncate string in the given length
## [1] "This is..."
Now lets do some cleaning and manupilation tricks with our dataset
which we have fetched from turbo.az car sell website. Because, we need
this dataset to plot regression models to draw out insights between
price and year (or engine, mileage) First, lets make this dataset by the
following code block.
library(rvest)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
link="https://turbo.az/autos"
page=read_html(link)
car_name=page %>% html_nodes("div.products-i__name.products-i__bottom-text") %>% html_text()
car_info=page %>% html_nodes("div.products-i__attributes.products-i__bottom-text") %>%html_text()
car_price=page%>%html_nodes("div.product-price")%>%html_text()
ad_time=page %>% html_nodes("div.products-i__datetime") %>% html_text()
car_data=data.frame(car_name, car_info, car_price, ad_time, stringsAsFactors = FALSE)
#row.names(car_data)<-NULL
head(car_data)
## car_name car_info car_price ad_time
## 1 Honda M-NV 2022, 0.0 L, 0 km 61 100 AZN Baki, dün<U+0259>n 15:19
## 2 DongFeng EQ 220 2022, 0.0 L, 0 km 29 500 AZN Baki, 15.10.2022 14:19
## 3 Chery Tiggo 2 Pro 2022, 1.5 L, 0 km 30 900 AZN Baki, 18.10.2022 11:48
## 4 Volkswagen ID.4 2022, 0.0 L, 0 km 44 500 $ Baki, dün<U+0259>n 15:24
## 5 Hyundai Elantra 2019, 2.0 L, 43 200 km 34 500 AZN Baki, dün<U+0259>n 12:29
## 6 Ford Fiesta 2016, 1.6 L, 162 544 km 14 300 AZN Baki, bugün 10:02
Join and Split strings
Now lets start to make our dataset more useful. Do these code blocks
in your Rstudio and inspect the result step by step.
#Lets, divide car info column into the three columns: year, engine and mileage
head(str_split(car_data$car_info, ","))
## [[1]]
## [1] "2022" " 0.0 L" " 0 km"
##
## [[2]]
## [1] "2022" " 0.0 L" " 0 km"
##
## [[3]]
## [1] "2022" " 1.5 L" " 0 km"
##
## [[4]]
## [1] "2022" " 0.0 L" " 0 km"
##
## [[5]]
## [1] "2019" " 2.0 L" " 43 200 km"
##
## [[6]]
## [1] "2016" " 1.6 L" " 162 544 km"
#Reverse of split function is str_c: str_c(c("Turn ", "me ", "into ", "one sentence"), collapse="")
#Replacing any character in a string
str_replace_all(car_data$car_price,"AZN","")
## [1] "61 100 " "29 500 " "30 900 " "44 500 $" "34 500 " "14 300 "
## [7] "33 800 " "31 500 $" "15 000 " "12 000 " "2 200 " "26 500 "
## [13] "24 700 $" "43 900 $" "44 900 $" "16 300 " "35 900 " "8 200 "
## [19] "42 900 $" "27 800 " "40 500 $" "6 400 " "6 000 " "52 900 $"
## [25] "45 000 " "4 500 " "49 500 $" "38 900 " "10 900 " "9 000 "
## [31] "11 000 " "29 900 " "9 200 " "45 900 $" "17 800 " "40 800 $"
Sorting strings
#First try str_order function, it gives us sorted indexes. Then use below code to order any dataset
car_data$car_name[str_order(car_data$car_name)]
## [1] "Audi A6" "Chery Tiggo 2 Pro"
## [3] "Chevrolet Cruze" "DongFeng EQ 220"
## [5] "Ford Fiesta" "Honda M-NV"
## [7] "Hyundai Elantra" "Hyundai Santa Fe"
## [9] "Hyundai Sonata" "Hyundai Venue"
## [11] "Kia Sorento" "Kia Sportage"
## [13] "LADA (VAZ) 2102" "LADA (VAZ) 2104"
## [15] "LADA (VAZ) 2107" "LADA (VAZ) 2107"
## [17] "LADA (VAZ) 21099" "Lexus GX 460"
## [19] "Mercedes CLS 350" "Mercedes E 200"
## [21] "Mercedes E 240" "Mercedes E 320"
## [23] "Mercedes G 500" "Mercedes Vito 111"
## [25] "Mini Cooper" "Mitsubishi Pajero"
## [27] "Mitsubishi Pajero Sport" "Mitsubishi Pajero Sport"
## [29] "Opel Zafira" "Opel Zafira"
## [31] "Peugeot 407" "Toyota Avalon"
## [33] "Toyota Land Cruiser" "Toyota Land Cruiser"
## [35] "Toyota Prado" "Volkswagen ID.4"
String matching
price<-str_detect(car_data$car_price, pattern="\\$") #Shows if this pattern exists in the string"
price
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [13] TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
## [25] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
price_indicies<-str_which(car_data$car_price, pattern="\\$") #Now lets get indicies of the matched strings
price_indicies
## [1] 4 8 13 14 15 19 21 24 27 34 36
car_data$car_price[price_indicies] #Lets extract matched strings using detected indicies
## [1] "44 500 $" "31 500 $" "24 700 $" "43 900 $" "44 900 $" "42 900 $"
## [7] "40 500 $" "52 900 $" "49 500 $" "45 900 $" "40 800 $"
Substring and Replace
Now we are going to get substring of our dataset sample. Assume we
want to fetch such data in which car prices is expensive than 15000.
First, we have to convert price column from character to numeric.
str_subset(car_data$car_name,"Kia")
## [1] "Kia Sorento" "Kia Sportage"
#Some tricks with sub_extract function-------:"
str <- c("G1:E001", "G2:E002", "G3:E003")
print(str)
## [1] "G1:E001" "G2:E002" "G3:E003"
str_extract(string = str, pattern = "E[0-9]+")
## [1] "E001" "E002" "E003"
str_remove(string = str, pattern = "^.*:") #"Replace leading string with "
## [1] "E001" "E002" "E003"
Lets divide car_price column into price and currency
gsub("[^0-9.-]", "", car_data$car_price)
## [1] "61100" "29500" "30900" "44500" "34500" "14300" "33800" "31500" "15000"
## [10] "12000" "2200" "26500" "24700" "43900" "44900" "16300" "35900" "8200"
## [19] "42900" "27800" "40500" "6400" "6000" "52900" "45000" "4500" "49500"
## [28] "38900" "10900" "9000" "11000" "29900" "9200" "45900" "17800" "40800"
gsub("[[:digit:]]", "", car_data$car_price)
## [1] " AZN" " AZN" " AZN" " $" " AZN" " AZN" " AZN" " $" " AZN"
## [10] " AZN" " AZN" " AZN" " $" " $" " $" " AZN" " AZN" " AZN"
## [19] " $" " AZN" " $" " AZN" " AZN" " $" " AZN" " AZN" " $"
## [28] " AZN" " AZN" " AZN" " AZN" " AZN" " AZN" " $" " AZN" " $"
Replace function
Lets learn 2 very useful functions about replacing characters in a
string.
head(car_data$car_price)
## [1] "61 100 AZN" "29 500 AZN" "30 900 AZN" "44 500 $" "34 500 AZN"
## [6] "14 300 AZN"
car_data%>%select(car_price)%>%
mutate(test=sub(pattern="AZN", replacement = "Manat",x=car_price))%>%head()
## car_price test
## 1 61 100 AZN 61 100 Manat
## 2 29 500 AZN 29 500 Manat
## 3 30 900 AZN 30 900 Manat
## 4 44 500 $ 44 500 $
## 5 34 500 AZN 34 500 Manat
## 6 14 300 AZN 14 300 Manat
car_data%>%select(car_price)%>%
mutate(test=str_replace(pattern="\\$", replacement = "Dollar",string=car_price))%>%head()
## car_price test
## 1 61 100 AZN 61 100 AZN
## 2 29 500 AZN 29 500 AZN
## 3 30 900 AZN 30 900 AZN
## 4 44 500 $ 44 500 Dollar
## 5 34 500 AZN 34 500 AZN
## 6 14 300 AZN 14 300 AZN
#Classwork, tell me difference between str_replace() and str_replace_all()
Viewing Strings
#Returns car_name columns first 6 rows"
car_data$car_name[1:6]
## [1] "Honda M-NV" "DongFeng EQ 220" "Chery Tiggo 2 Pro"
## [4] "Volkswagen ID.4" "Hyundai Elantra" "Ford Fiesta"
#Returns data as html text with highlighted pattern"
#str_view(car_data$car_name,pattern="Kia")
Classwork:
- Practice with strings https://r4ds.had.co.nz/strings.html
- More powerfull package: https://r4ds.had.co.nz/strings.html#stringi
- Clean data from bina.az.csv (a datset made by mr. Cavad)
- Make some regular expression statements : Link
here