DATA607 Tidyverse Recipe 1

The `stringr` package

Overview

The stringr package is a consistent, simple and easy to use set of wrappers that is built on top of the stringi package. The stringi package uses the ICU C library to provide fast, correct implementations of common string manipulations and provides a comprehensive set covering almost anything you can imagine, while stringr focusses on the most important and commonly used string manipulation functions.

All functions in stringr start with str_ and take a vector of strings as the first argument. One of the uses of this package that I enjoy a lot is in tidying up the output goten from webscrapping a webpage (using appropriate functions from the tidyvest package) to make the result presentable.

Assuming I want to get some top selling books in a particular category like Law from Amazon, I can easily call some of the stringr functions like:

1). str_replace_all: to replace selected string pattern by another string or an empty one. The pattern can be matched with rejex

2). str_trim: trims the beginning and end of a string

3). str_extract: extracts a given string pattern from the lot. The pattern can be matched with rejext

4). str_split: splits a string at a given character

to tidy up the result. Each of these listed functions perform functions just as their names indicate.

Load libraries, read web page

library(stringr)
library(rvest)

## Loading required package: xml2

library(DT)

html_raw <- html_nodes(read_html('https://www.amazon.com/Best-Sellers-Books-Law/zgbs/books/10777/ref=zg_bs_nav_b_1_b', encoding = "en_US.UTF-8"), 'ol#zg-ordered-list')

Tidy up the Result

processNodes <- function(rawNode, genre)
{
    imgs <- vector()
    genres <- vector()
    titles <- vector()
    authors <- vector()
    ratings <- vector()
    numRaters <- vector()
    types <- vector()
    prices <- vector()
    
    itemsRaw <-html_nodes(rawNode, 'span.zg-item')  # using html_nodes from rvest to extract the needed nodes
  
    for(i in 1: length(itemsRaw))
    {
        x <- itemsRaw[i]
        xImg <- str_replace_all(str_extract(x, '(?<=src=)"(.+?)"'), '\"', '') # matching the image element's source attribute using a rejex pattern
        
        xTitle <- str_trim(str_replace_all(html_text(html_node(x, '.a-link-normal > div')), 
                                           "[\r\n]" , ""), side='both') # combining str_trim, str_replace_all, html_text and html_node in one call to tidy a book's title
        
        if(str_detect(xTitle, ':') == TRUE)
        {
          xTitle <- str_split(xTitle, ':')[[1]][1] # Shorten the title by splitting at the colon (:)
        }
        
        xAuthor <- str_trim(html_text(html_node(x, '.a-link-child')), side='both')
        
        if(is.na(xAuthor) || is.null(xAuthor))
        {
          xAuthor <- str_trim(html_text(html_node(x, 'span.a-color-base')), side='both')
        }
        
        xRating <- str_trim(str_extract(html_node(x, '.a-icon-alt'), '\\d+(\\.\\d+)'), side='both')
        
        xNumRaters <- str_replace(str_trim(html_text(html_node(x, 'div.a-icon-row > a.a-size-small')), side='both'), ',', '')
        
        if(is.na(xNumRaters) || is.null(xNumRaters))
        {
          xNumRaters <- str_replace(str_trim(html_text(html_node(x, 
                        'div.a-spacing-none > a-link-normal')), side='both'), ',', '')
        }
       
        xType <- str_trim(html_text(html_node(x, 'div.a-size-small > span.a-color-secondary')), side='both')
        
        p <- str_extract(html_node(x, 'span.p13n-sc-price'), '\\d+(\\.\\d+)')
        
        xPrice <-  round(as.numeric(p), 2)
        
        if(is.na(xPrice) || is.null(xPrice))
        {
          xPrice <- str_extract(str_trim(html_text(html_node(x, 
                        'span.a-color-price')), side='both'), '[[:alpha:] ]{2,}')
        }
        
        imgs <- c(imgs, xImg)
        genres <- c(genres, genre)
        titles <- c(titles, xTitle)
        authors <- c(authors, xAuthor)
        ratings <- c(ratings, xRating)
        numRaters <- c(numRaters, xNumRaters)
        types <- c(types, xType)
        prices <- c(prices, xPrice)
    }
    
    return(data.frame(title = titles, author = authors, genre = genres, rating = ratings, img = imgs,
                      numRater = numRaters, type = types, price = prices))
}

This Returns a dataframe with the cleaned observations

law_books <- processNodes(html_raw, 'Law')

Show the output

law_books_filtered <- subset(law_books, select=c(title, author, genre, rating, numRater, type, price)) #Remove the book image url

datatable(law_books_filtered, colnames= c("Title", "Author", "Genre", "Rating", "Number of reviewers", "Type", "Price($)"), class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

Exporting the tidied data to a csv file for further analyses

write.csv(law_books_filtered, "top_selling_law_books.csv", row.names=FALSE)

Conclusion

The stringr package is very useful when it comes to cleaning data mined from different sources and manipulating already available data sets. This recipe has demonstrated the use of some of its functions in data cleaning and manipulation. However, this is not exhaustive.

References

stringr

DATA607 Tidyverse Recipe 1

Henry Otuadinma

May 3, 2019

The `stringr` package

Overview

Load libraries, read web page

Tidy up the Result

This Returns a dataframe with the cleaned observations

Show the output

Exporting the tidied data to a csv file for further analyses

Conclusion

References

DATA607 Tidyverse Recipe 1

Henry Otuadinma

May 3, 2019

The stringr package

Overview

Load libraries, read web page

Tidy up the Result

This Returns a dataframe with the cleaned observations

Show the output

Exporting the tidied data to a csv file for further analyses

Conclusion

References

The `stringr` package