I would like to scrape a beauty retailer’s website and capture product pricing data. I’m what you call a beauty junkie, someone who loves to look for and test new beauty products. It will be interesting to retrieve data from a beauty website that I can manipulate on my own. For instance, pricing data is useful for making price comparisons across retailer websites and examining price increases over time.
Sephora vs Ulta - it’s a rivalry that has been brewing for some time and is heating up.
They are the heavyweights in the specialty beauty retail industry. Ulta is the largest beauty retailer in the U.S., with net sales of $6.7 billion in 2018 and about 1200 stores (Footnote 1). Sephora operates about 500 stores in the U.S. Its exact revenue numbers are unknown as it is privately held by the French LVMH conglomerate. But estimates are between $1-4 billion for US and about $6 billion globally. (Footnote 2).
Sephora and Ulta started off in different channels three decades ago. Sephora operated in the prestige cosmetics segment, while Ulta offered drugstore and professional hair care. The beauty landscape has changed greatly since then, with the lines bluring between channels. Today, Sephora still offers high-end, luxury brands, while Ulta offers a mix of budget, mid-tier, and luxury brands (Footnote 3). Their product offerings have some overlap and they are converging upon the same customers. (Footnote 2). For example, both are trying to tap the lucrative teen market, and Sephora is feeling the competition from Ulta’s highly successful store growth and marketing strategy. (Footnote 4).
I’ve been observing these changes personally, as a longtime shopper at both stores. In the past three years, my spending dollars have shifted from Sephora to Ulta. Ulta keeps adding new brands and adopting successful marketing tactics to win my loyalty. Ulta’s marketing strategy is working on me! It would not surprise me if Ulta’s brand image catches up to Sephora’s in 5 years’ time.
I first attempted to scrape product data from Sephora. Sephora was problematic, as the CSS selector did not pick up some of the elements consistently. Ulta yielded clean data, so I picked Ulta to scrape for this project..
Scraping Ulta is a formidable tasl. Ulta carries more than 500 brands and 25,000 products (Footnotes 4, 5).
Since this is my first time scraping, I need to focus on a small slice to make this project manageable.
Many skincare products come in multiple sizes. This creates a problem when scraping price data. The prices for multiple sizes are shown as a range on a search results page.
Example of a product with a price range:
Brand: Clinique
Product Name: All About Eyes
Price Range: $33.00 - $54.50 <– 2 sizes (0.5 oz, 1 oz)
Select a category which you know comes in mostly one size. That way, you can avoid the price range issue for this project. These categories are good choices: Eye Treatments, Exfoliators, Masks, Lip Treatments.
Let’s pick Eye Treatments. The vast majority of eye treatments come in the standard size of 0.5 oz (15 ml). For the occasional eye treatment that is offered in multiple sizes, we’ll deal with them on a case by case basis (i.e. clean the data manually).
URL to be scraped: https://www.ulta.com/skin-care-eye-treatments?N=270k
Data hierarchy: Ulta.com / Skincare / Eye Treatments / p.1
Use R package rvest to scrape data from HTML web pages.
Use Selector Gadget Chrome extension to get the CSS selector for each desired element (brand, product name, price, number of reviews) from the web page.
Adapt some of the code chunks from IMDb Web Scraping Tutorial in Week 8: https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/
The price range is captured as text. Extra text needs to be removed, and the number portion converted to numerical.
The removal of the $ symbol keeps the lower price in a price range and truncates the higher price. For the sake of simplicity, and since there are only a few cases of this, let’s allow the lower price into the dataset, even if the size isn’t equivalent to all the other products being scraped.
Ready to scrape and clean..
#Loading the rvest package
library('rvest')
## Loading required package: xml2
#Specifying the url for desired website to be scraped - Ulta.com / Skincare / Eye Treatments category / page 1
eyeTx_url <- 'https://www.ulta.com/skin-care-eye-treatments?N=270k'
#Reading the HTML code from the website
webpage <- read_html(eyeTx_url)
#Using CSS selectors to scrape the prices
price_html <- html_nodes(webpage,'.regPrice')
#Converting the price data to text
price_data <- html_text(price_html)
#Let's have a look at the price data for first 10 rows
head(price_data, n = 10)
## [1] "\r\n\t\t\t\t\t\t$38.00" "\r\n\t\t\t\t\t\t$69.00"
## [3] "\r\n\t\t\t\t\t\t$33.00 - $54.50" "\r\n\t\t\t\t\t\t$42.00"
## [5] "\r\n\t\t\t\t\t\t$70.00" "\r\n\t\t\t\t\t\t$33.00 - $54.50"
## [7] "\r\n\t\t\t\t\t\t$65.00" "\r\n\t\t\t\t\t\t$34.00"
## [9] "\r\n\t\t\t\t\t\t$30.00 - $48.00" "\r\n\t\t\t\t\t\t$62.00"
This is not bad. The prices are showing up. There are ’ preceding the prices. These can be removed. Some of the prices appear as a range.
#Data-Preprocessing: removing '\r', '\n', '\t' from prices
price_data<-gsub("\r","",price_data)
price_data<-gsub("\n","",price_data)
price_data<-gsub("\t","",price_data)
head(price_data, n = 10)
## [1] "$38.00" "$69.00" "$33.00 - $54.50"
## [4] "$42.00" "$70.00" "$33.00 - $54.50"
## [7] "$65.00" "$34.00" "$30.00 - $48.00"
## [10] "$62.00"
The r’s, n’s, and t’s are removed.
#Data-Preprocessing: removing '$'
price_data<-gsub("$","",price_data)
head(price_data, n = 10)
## [1] "$38.00" "$69.00" "$33.00 - $54.50"
## [4] "$42.00" "$70.00" "$33.00 - $54.50"
## [7] "$65.00" "$34.00" "$30.00 - $48.00"
## [10] "$62.00"
This does not work. Why is ‘$’ not removed?
Maybe it’s not necessary. The IMDB Web Scraping tutorial did not have a line of code for removing the ‘$’ from gross_data, but it got removed when the string was converted to text. Let’s try that shortly.
# But first, let's check the number of products that were successfully scraped.
length(price_data)
## [1] 67
Almost. There are actually 96 products on this page. One did not get picked up.
I will have to look at product list in the data frame and compare with the web page to determine why.
#Data-Preprocessing: converting price_data to numerical
price_data<-as.numeric(price_data)
## Warning: NAs introduced by coercion
OK, so we have the dreaded “NA” error message that I encountered in the IMDB tutorial. Let’s see what’s going on.
head(price_data,n=96)
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [24] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
So, converting the price data to numerical did not work. The ‘$’ sign is probably the culprit.
Oh I figured it out! Looking more closely at the IMDB tutorial, I found the line of code that removes the ‘$’. You remove it by the substring function.
Excerpt from a web search:
“The substring function in R can be used either to extract parts of character strings, or to change the values of parts of character strings”
Syntax for Substring function in R:
substr(text, start, stop) substring(text, first, last = 1000000L)
First Argument (Text) is the string Second argument (start/first) is start position of the substring Third argument(stop/last) is end position of the substring
I will incorporate the substring function later. But first, let me address the issue of the missing product.
Here is a modification to the earlier CSS selector code that will resolve the issue of the missing product.
Note the addition of .pro-old-price in the code below. There are a few products on sale. They have a sale price followed by a regular price that is crossed off. The scraper does not pick up the sale price without the inclusion of this second CSS selector (.pro-old-price).
#Using CSS selectors to scrape the prices - MODIFIED CODE w/ 2nd CSS selector for old price
price_html <- html_nodes(webpage,'.regPrice , .pro-old-price')
#Converting the price data to text
price_data <- html_text(price_html)
#Let's have a look at the price data for all the rows
head(price_data, n = 96)
## [1] "\r\n\t\t\t\t\t\t$38.00" "\r\n\t\t\t\t\t\t$69.00"
## [3] "\r\n\t\t\t\t\t\t$33.00 - $54.50" "\r\n\t\t\t\t\t\t$42.00"
## [5] " \r\n\t\t\t\t\t\t\t$32.00" "\r\n\t\t\t\t\t\t$70.00"
## [7] "\r\n\t\t\t\t\t\t$33.00 - $54.50" "\r\n\t\t\t\t\t\t$65.00"
## [9] " \r\n\t\t\t\t\t\t\t$30.00" "\r\n\t\t\t\t\t\t$34.00"
## [11] "\r\n\t\t\t\t\t\t$30.00 - $48.00" " \r\n\t\t\t\t\t\t\t$65.00"
## [13] " \r\n\t\t\t\t\t\t\t$28.00" "\r\n\t\t\t\t\t\t$62.00"
## [15] "\r\n\t\t\t\t\t\t$21.00" "\r\n\t\t\t\t\t\t$46.00"
## [17] "\r\n\t\t\t\t\t\t$16.00" "\r\n\t\t\t\t\t\t$32.00"
## [19] "\r\n\t\t\t\t\t\t$75.00" "\r\n\t\t\t\t\t\t$37.00"
## [21] "\r\n\t\t\t\t\t\t$75.00" "\r\n\t\t\t\t\t\t$85.00"
## [23] "\r\n\t\t\t\t\t\t$80.00" "\r\n\t\t\t\t\t\t$10.00"
## [25] "\r\n\t\t\t\t\t\t$22.99" " \r\n\t\t\t\t\t\t\t$60.00"
## [27] "\r\n\t\t\t\t\t\t$48.00" " \r\n\t\t\t\t\t\t\t$18.00"
## [29] "\r\n\t\t\t\t\t\t$52.00" "\r\n\t\t\t\t\t\t$69.00"
## [31] "\r\n\t\t\t\t\t\t$10.00" "\r\n\t\t\t\t\t\t$50.50"
## [33] "\r\n\t\t\t\t\t\t$85.00" "\r\n\t\t\t\t\t\t$38.00"
## [35] "\r\n\t\t\t\t\t\t$34.00" " \r\n\t\t\t\t\t\t\t$40.00"
## [37] "\r\n\t\t\t\t\t\t$69.00" "\r\n\t\t\t\t\t\t$24.99"
## [39] "\r\n\t\t\t\t\t\t$33.00" "\r\n\t\t\t\t\t\t$34.00"
## [41] "\r\n\t\t\t\t\t\t$62.00" "\r\n\t\t\t\t\t\t$67.00"
## [43] "\r\n\t\t\t\t\t\t$29.95" "\r\n\t\t\t\t\t\t$38.99"
## [45] "\r\n\t\t\t\t\t\t$82.00" "\r\n\t\t\t\t\t\t$65.00"
## [47] " \r\n\t\t\t\t\t\t\t$49.00" "\r\n\t\t\t\t\t\t$60.00"
## [49] "\r\n\t\t\t\t\t\t$22.99" "\r\n\t\t\t\t\t\t$15.99"
## [51] " \r\n\t\t\t\t\t\t\t$24.99" " \r\n\t\t\t\t\t\t\t$23.99"
## [53] "\r\n\t\t\t\t\t\t$59.00" "\r\n\t\t\t\t\t\t$24.99"
## [55] " \r\n\t\t\t\t\t\t\t$50.00" "\r\n\t\t\t\t\t\t$68.00"
## [57] " \r\n\t\t\t\t\t\t\t$69.00" "\r\n\t\t\t\t\t\t$28.00"
## [59] " \r\n\t\t\t\t\t\t\t$60.00" "\r\n\t\t\t\t\t\t$21.99"
## [61] "\r\n\t\t\t\t\t\t$39.00" "\r\n\t\t\t\t\t\t$65.00"
## [63] " \r\n\t\t\t\t\t\t\t$50.00" "\r\n\t\t\t\t\t\t$53.00"
## [65] " \r\n\t\t\t\t\t\t\t$69.00" " \r\n\t\t\t\t\t\t\t$52.00"
## [67] " \r\n\t\t\t\t\t\t\t$58.00" "\r\n\t\t\t\t\t\t$38.00"
## [69] " \r\n\t\t\t\t\t\t\t$75.00" " \r\n\t\t\t\t\t\t\t$80.00"
## [71] "\r\n\t\t\t\t\t\t$42.00" "\r\n\t\t\t\t\t\t$42.00"
## [73] " \r\n\t\t\t\t\t\t\t$35.00" " \r\n\t\t\t\t\t\t\t$20.00"
## [75] "\r\n\t\t\t\t\t\t$27.50" " \r\n\t\t\t\t\t\t\t$122.00"
## [77] " \r\n\t\t\t\t\t\t\t$48.00" " \r\n\t\t\t\t\t\t\t$110.00"
## [79] "\r\n\t\t\t\t\t\t$49.00" "\r\n\t\t\t\t\t\t$55.00"
## [81] " \r\n\t\t\t\t\t\t\t$23.99" " \r\n\t\t\t\t\t\t\t$15.00"
## [83] "\r\n\t\t\t\t\t\t$23.99" "\r\n\t\t\t\t\t\t$27.49"
## [85] "\r\n\t\t\t\t\t\t$29.99" "\r\n\t\t\t\t\t\t$12.49"
## [87] "\r\n\t\t\t\t\t\t$25.00" " \r\n\t\t\t\t\t\t\t$68.00"
## [89] "\r\n\t\t\t\t\t\t$18.95" " \r\n\t\t\t\t\t\t\t$15.00"
## [91] "\r\n\t\t\t\t\t\t$32.00" "\r\n\t\t\t\t\t\t$42.00"
## [93] "\r\n\t\t\t\t\t\t$19.00" "\r\n\t\t\t\t\t\t$64.00"
## [95] "\r\n\t\t\t\t\t\t$60.00" " \r\n\t\t\t\t\t\t\t$38.00"
Now, all 96 products have been picked up.
Next, search Google for how to remove extra spaces.
Found this example:
sentenceString <- ’ Dan is here. ’ sentenceString = trimws(sentenceString) sentenceString
The trimws function removes spaces for one string. But how about a vector of strings? Let’s try it.
#Data-Preprocessing: removing extra space before the price using trimws function
# Product #53 is the one on sale, from a visual examination of web page, and has an extra space preceding its price.
price_data = trimws(price_data)
head(price_data, n = 53)
## [1] "$38.00" "$69.00" "$33.00 - $54.50"
## [4] "$42.00" "$32.00" "$70.00"
## [7] "$33.00 - $54.50" "$65.00" "$30.00"
## [10] "$34.00" "$30.00 - $48.00" "$65.00"
## [13] "$28.00" "$62.00" "$21.00"
## [16] "$46.00" "$16.00" "$32.00"
## [19] "$75.00" "$37.00" "$75.00"
## [22] "$85.00" "$80.00" "$10.00"
## [25] "$22.99" "$60.00" "$48.00"
## [28] "$18.00" "$52.00" "$69.00"
## [31] "$10.00" "$50.50" "$85.00"
## [34] "$38.00" "$34.00" "$40.00"
## [37] "$69.00" "$24.99" "$33.00"
## [40] "$34.00" "$62.00" "$67.00"
## [43] "$29.95" "$38.99" "$82.00"
## [46] "$65.00" "$49.00" "$60.00"
## [49] "$22.99" "$15.99" "$24.99"
## [52] "$23.99" "$59.00"
# NOTE: The site is dynamic and changes all the time. Having returned to this project a week later, Product #53 is no longer the product with the extra space.
Whew! It worked. The extra space in Product #53 is gone. Moving on to more cleaning..
#Data-Preprocessing: removing '\r', '\n', '\t' from prices
price_data<-gsub("\r","",price_data)
price_data<-gsub("\n","",price_data)
price_data<-gsub("\t","",price_data)
head(price_data, n = 96)
## [1] "$38.00" "$69.00" "$33.00 - $54.50"
## [4] "$42.00" "$32.00" "$70.00"
## [7] "$33.00 - $54.50" "$65.00" "$30.00"
## [10] "$34.00" "$30.00 - $48.00" "$65.00"
## [13] "$28.00" "$62.00" "$21.00"
## [16] "$46.00" "$16.00" "$32.00"
## [19] "$75.00" "$37.00" "$75.00"
## [22] "$85.00" "$80.00" "$10.00"
## [25] "$22.99" "$60.00" "$48.00"
## [28] "$18.00" "$52.00" "$69.00"
## [31] "$10.00" "$50.50" "$85.00"
## [34] "$38.00" "$34.00" "$40.00"
## [37] "$69.00" "$24.99" "$33.00"
## [40] "$34.00" "$62.00" "$67.00"
## [43] "$29.95" "$38.99" "$82.00"
## [46] "$65.00" "$49.00" "$60.00"
## [49] "$22.99" "$15.99" "$24.99"
## [52] "$23.99" "$59.00" "$24.99"
## [55] "$50.00" "$68.00" "$69.00"
## [58] "$28.00" "$60.00" "$21.99"
## [61] "$39.00" "$65.00" "$50.00"
## [64] "$53.00" "$69.00" "$52.00"
## [67] "$58.00" "$38.00" "$75.00"
## [70] "$80.00" "$42.00" "$42.00"
## [73] "$35.00" "$20.00" "$27.50"
## [76] "$122.00" "$48.00" "$110.00"
## [79] "$49.00" "$55.00" "$23.99"
## [82] "$15.00" "$23.99" "$27.49"
## [85] "$29.99" "$12.49" "$25.00"
## [88] "$68.00" "$18.95" "$15.00"
## [91] "$32.00" "$42.00" "$19.00"
## [94] "$64.00" "$60.00" "$38.00"
Fine so far.
#Data-Preprocessing: removing '$' from prices
price_data<-substring(price_data,2,7)
head(price_data, n = 96)
## [1] "38.00" "69.00" "33.00 " "42.00" "32.00" "70.00" "33.00 "
## [8] "65.00" "30.00" "34.00" "30.00 " "65.00" "28.00" "62.00"
## [15] "21.00" "46.00" "16.00" "32.00" "75.00" "37.00" "75.00"
## [22] "85.00" "80.00" "10.00" "22.99" "60.00" "48.00" "18.00"
## [29] "52.00" "69.00" "10.00" "50.50" "85.00" "38.00" "34.00"
## [36] "40.00" "69.00" "24.99" "33.00" "34.00" "62.00" "67.00"
## [43] "29.95" "38.99" "82.00" "65.00" "49.00" "60.00" "22.99"
## [50] "15.99" "24.99" "23.99" "59.00" "24.99" "50.00" "68.00"
## [57] "69.00" "28.00" "60.00" "21.99" "39.00" "65.00" "50.00"
## [64] "53.00" "69.00" "52.00" "58.00" "38.00" "75.00" "80.00"
## [71] "42.00" "42.00" "35.00" "20.00" "27.50" "122.00" "48.00"
## [78] "110.00" "49.00" "55.00" "23.99" "15.00" "23.99" "27.49"
## [85] "29.99" "12.49" "25.00" "68.00" "18.95" "15.00" "32.00"
## [92] "42.00" "19.00" "64.00" "60.00" "38.00"
The substring function worked after playing around with the last argument (stop position). Initially tried it with 6, it lopped off the first number after ‘\('. Not sure why 6 doesn't work, but 7 does. There are only 5 'characters' after '\).’
F/U - Ask classmate Steve D. why. He understands this syntax.
Now let’s convert this string price data into numerical.
#Data-Preprocessing: converting price_data to numerical
price_data<-as.numeric(price_data)
head(price_data, n = 96)
## [1] 38.00 69.00 33.00 42.00 32.00 70.00 33.00 65.00 30.00 34.00
## [11] 30.00 65.00 28.00 62.00 21.00 46.00 16.00 32.00 75.00 37.00
## [21] 75.00 85.00 80.00 10.00 22.99 60.00 48.00 18.00 52.00 69.00
## [31] 10.00 50.50 85.00 38.00 34.00 40.00 69.00 24.99 33.00 34.00
## [41] 62.00 67.00 29.95 38.99 82.00 65.00 49.00 60.00 22.99 15.99
## [51] 24.99 23.99 59.00 24.99 50.00 68.00 69.00 28.00 60.00 21.99
## [61] 39.00 65.00 50.00 53.00 69.00 52.00 58.00 38.00 75.00 80.00
## [71] 42.00 42.00 35.00 20.00 27.50 122.00 48.00 110.00 49.00 55.00
## [81] 23.99 15.00 23.99 27.49 29.99 12.49 25.00 68.00 18.95 15.00
## [91] 32.00 42.00 19.00 64.00 60.00 38.00
str(price_data)
## num [1:96] 38 69 33 42 32 70 33 65 30 34 ...
A successful conversion. So far so good.
Now, let’s grab the brand name, product name, and number of ratings before we create the data frame.
#Using CSS selectors to scrape the brand names
brand_html <- html_nodes(webpage,'.prod-title')
#Converting the title data to text
brand_data <- html_text(brand_html)
#Let's have a look at the title
head(brand_data, n=96)
## [1] "\n\t\t\t\n\t\t\t\tIt Cosmetics\n\t\t"
## [2] "\n\t\t\t\n\t\t\t\tEstée Lauder\n\t\t"
## [3] "\n\t\t\t\n\t\t\t\tClinique\n\t\t"
## [4] "\n\t\t\t\n\t\t\t\tDr. Brandt\n\t\t"
## [5] "\n\t\t\t\n\t\t\t\tSkyn Iceland\n\t\t"
## [6] "\n\t\t\t\n\t\t\t\tShiseido\n\t\t"
## [7] "\n\t\t\t\n\t\t\t\tClinique\n\t\t"
## [8] "\n\t\t\t\n\t\t\t\tDermalogica\n\t\t"
## [9] "\n\t\t\t\n\t\t\t\tTula\n\t\t"
## [10] "\n\t\t\t\n\t\t\t\tflorence by mills\n\t\t"
## [11] "\n\t\t\t\n\t\t\t\tKiehl's Since 1851\n\t\t"
## [12] "\n\t\t\t\n\t\t\t\tStriVectin\n\t\t"
## [13] "\n\t\t\t\n\t\t\t\tTula\n\t\t"
## [14] "\n\t\t\t\n\t\t\t\tAhava\n\t\t"
## [15] "\n\t\t\t\n\t\t\t\tDerma E\n\t\t"
## [16] "\n\t\t\t\n\t\t\t\tClinique\n\t\t"
## [17] "\n\t\t\t\n\t\t\t\tflorence by mills\n\t\t"
## [18] "\n\t\t\t\n\t\t\t\tOrigins\n\t\t"
## [19] "\n\t\t\t\n\t\t\t\tLancôme\n\t\t"
## [20] "\n\t\t\t\n\t\t\t\tKiehl's Since 1851\n\t\t"
## [21] "\n\t\t\t\n\t\t\t\tPeter Thomas Roth\n\t\t"
## [22] "\n\t\t\t\n\t\t\t\tKate Somerville\n\t\t"
## [23] "\n\t\t\t\n\t\t\t\tDermalogica\n\t\t"
## [24] "\n\t\t\t\n\t\t\t\tEarth Therapeutics\n\t\t"
## [25] "\n\t\t\t\n\t\t\t\tNo7\n\t\t"
## [26] "\n\t\t\t\n\t\t\t\tPhilosophy\n\t\t"
## [27] "\n\t\t\t\n\t\t\t\tBareMinerals\n\t\t"
## [28] "\n\t\t\t\n\t\t\t\tMario Badescu\n\t\t"
## [29] "\n\t\t\t\n\t\t\t\tPeter Thomas Roth\n\t\t"
## [30] "\n\t\t\t\n\t\t\t\tDermalogica\n\t\t"
## [31] "\n\t\t\t\n\t\t\t\tEarth Therapeutics\n\t\t"
## [32] "\n\t\t\t\n\t\t\t\tClinique\n\t\t"
## [33] "\n\t\t\t\n\t\t\t\tClarins\n\t\t"
## [34] "\n\t\t\t\n\t\t\t\tTarte\n\t\t"
## [35] "\n\t\t\t\n\t\t\t\tClinique\n\t\t"
## [36] "\n\t\t\t\n\t\t\t\tSkyn Iceland\n\t\t"
## [37] "\n\t\t\t\n\t\t\t\tLancôme\n\t\t"
## [38] "\n\t\t\t\n\t\t\t\tRoC\n\t\t"
## [39] "\n\t\t\t\n\t\t\t\tClinique\n\t\t"
## [40] "\n\t\t\t\n\t\t\t\tBenefit Cosmetics\n\t\t"
## [41] "\n\t\t\t\n\t\t\t\tEstée Lauder\n\t\t"
## [42] "\n\t\t\t\n\t\t\t\tLancôme\n\t\t"
## [43] "\n\t\t\t\n\t\t\t\tDerma E\n\t\t"
## [44] "\n\t\t\t\n\t\t\t\tOlay\n\t\t"
## [45] "\n\t\t\t\n\t\t\t\tMurad\n\t\t"
## [46] "\n\t\t\t\n\t\t\t\tSUNDAY RILEY\n\t\t"
## [47] "\n\t\t\t\n\t\t\t\tStriVectin\n\t\t"
## [48] "\n\t\t\t\n\t\t\t\tElizabeth Arden\n\t\t"
## [49] "\n\t\t\t\n\t\t\t\tNo7\n\t\t"
## [50] "\n\t\t\t\n\t\t\t\tCeraVe\n\t\t"
## [51] "\n\t\t\t\n\t\t\t\tMad Hippie\n\t\t"
## [52] "\n\t\t\t\n\t\t\t\tNeutrogena\n\t\t"
## [53] "\n\t\t\t\n\t\t\t\tDermalogica\n\t\t"
## [54] "\n\t\t\t\n\t\t\t\tDerma E\n\t\t"
## [55] "\n\t\t\t\n\t\t\t\tPatchology\n\t\t"
## [56] "\n\t\t\t\n\t\t\t\tShiseido\n\t\t"
## [57] "\n\t\t\t\n\t\t\t\tStriVectin\n\t\t"
## [58] "\n\t\t\t\n\t\t\t\tPEACH & LILY\n\t\t"
## [59] "\n\t\t\t\n\t\t\t\tPatchology\n\t\t"
## [60] "\n\t\t\t\n\t\t\t\tNo7\n\t\t"
## [61] "\n\t\t\t\n\t\t\t\tKiehl's Since 1851\n\t\t"
## [62] "\n\t\t\t\n\t\t\t\tLancôme\n\t\t"
## [63] "\n\t\t\t\n\t\t\t\tJuice Beauty\n\t\t"
## [64] "\n\t\t\t\n\t\t\t\tDermalogica\n\t\t"
## [65] "\n\t\t\t\n\t\t\t\tStriVectin\n\t\t"
## [66] "\n\t\t\t\n\t\t\t\tTula\n\t\t"
## [67] "\n\t\t\t\n\t\t\t\tTula\n\t\t"
## [68] "\n\t\t\t\n\t\t\t\tPeter Thomas Roth\n\t\t"
## [69] "\n\t\t\t\n\t\t\t\tStriVectin\n\t\t"
## [70] "\n\t\t\t\n\t\t\t\tExuviance\n\t\t"
## [71] "\n\t\t\t\n\t\t\t\tClinique\n\t\t"
## [72] "\n\t\t\t\n\t\t\t\tFirst Aid Beauty\n\t\t"
## [73] "\n\t\t\t\n\t\t\t\tDHC\n\t\t"
## [74] "\n\t\t\t\n\t\t\t\tMario Badescu\n\t\t"
## [75] "\n\t\t\t\n\t\t\t\tClinique\n\t\t"
## [76] "\n\t\t\t\n\t\t\t\tPerricone MD\n\t\t"
## [77] "\n\t\t\t\n\t\t\t\tSkyn Iceland\n\t\t"
## [78] "\n\t\t\t\n\t\t\t\tPerricone MD\n\t\t"
## [79] "\n\t\t\t\n\t\t\t\tELEMIS\n\t\t"
## [80] "\n\t\t\t\n\t\t\t\tPeter Thomas Roth\n\t\t"
## [81] "\n\t\t\t\n\t\t\t\tNeutrogena\n\t\t"
## [82] "\n\t\t\t\n\t\t\t\tPatchology\n\t\t"
## [83] "\n\t\t\t\n\t\t\t\tNo7\n\t\t"
## [84] "\n\t\t\t\n\t\t\t\tRoC\n\t\t"
## [85] "\n\t\t\t\n\t\t\t\tOlay\n\t\t"
## [86] "\n\t\t\t\n\t\t\t\tOlay\n\t\t"
## [87] "\n\t\t\t\n\t\t\t\tSoap & Glory\n\t\t"
## [88] "\n\t\t\t\n\t\t\t\tPhilosophy\n\t\t"
## [89] "\n\t\t\t\n\t\t\t\tDerma E\n\t\t"
## [90] "\n\t\t\t\n\t\t\t\tStriVectin\n\t\t"
## [91] "\n\t\t\t\n\t\t\t\tBareMinerals\n\t\t"
## [92] "\n\t\t\t\n\t\t\t\tOrigins\n\t\t"
## [93] "\n\t\t\t\n\t\t\t\tTarte\n\t\t"
## [94] "\n\t\t\t\n\t\t\t\tClarins\n\t\t"
## [95] "\n\t\t\t\n\t\t\t\tKate Somerville\n\t\t"
## [96] "\n\t\t\t\n\t\t\t\tPhilosophy\n\t\t"
Ok good, the brand names are being picked up. But they need to be cleaned up too.
#Data-Preprocessing: removing '\n', '\t'from brand names
brand_data<-gsub("\n","",brand_data)
brand_data<-gsub("\t","",brand_data)
head(brand_data, n = 96)
## [1] "It Cosmetics" "Estée Lauder" "Clinique"
## [4] "Dr. Brandt" "Skyn Iceland" "Shiseido"
## [7] "Clinique" "Dermalogica" "Tula"
## [10] "florence by mills" "Kiehl's Since 1851" "StriVectin"
## [13] "Tula" "Ahava" "Derma E"
## [16] "Clinique" "florence by mills" "Origins"
## [19] "Lancôme" "Kiehl's Since 1851" "Peter Thomas Roth"
## [22] "Kate Somerville" "Dermalogica" "Earth Therapeutics"
## [25] "No7" "Philosophy" "BareMinerals"
## [28] "Mario Badescu" "Peter Thomas Roth" "Dermalogica"
## [31] "Earth Therapeutics" "Clinique" "Clarins"
## [34] "Tarte" "Clinique" "Skyn Iceland"
## [37] "Lancôme" "RoC" "Clinique"
## [40] "Benefit Cosmetics" "Estée Lauder" "Lancôme"
## [43] "Derma E" "Olay" "Murad"
## [46] "SUNDAY RILEY" "StriVectin" "Elizabeth Arden"
## [49] "No7" "CeraVe" "Mad Hippie"
## [52] "Neutrogena" "Dermalogica" "Derma E"
## [55] "Patchology" "Shiseido" "StriVectin"
## [58] "PEACH & LILY" "Patchology" "No7"
## [61] "Kiehl's Since 1851" "Lancôme" "Juice Beauty"
## [64] "Dermalogica" "StriVectin" "Tula"
## [67] "Tula" "Peter Thomas Roth" "StriVectin"
## [70] "Exuviance" "Clinique" "First Aid Beauty"
## [73] "DHC" "Mario Badescu" "Clinique"
## [76] "Perricone MD" "Skyn Iceland" "Perricone MD"
## [79] "ELEMIS" "Peter Thomas Roth" "Neutrogena"
## [82] "Patchology" "No7" "RoC"
## [85] "Olay" "Olay" "Soap & Glory"
## [88] "Philosophy" "Derma E" "StriVectin"
## [91] "BareMinerals" "Origins" "Tarte"
## [94] "Clarins" "Kate Somerville" "Philosophy"
#Using CSS selectors to scrape the product names
product_html <- html_nodes(webpage,'.prod-desc')
#Converting the title data to text
product_data <- html_text(product_html)
#Let's have a look at the title
head(product_data, n=96)
## [1] "\n\t\t\t\n\t\t\t\tConfidence in an Eye Cream\n\t\t"
## [2] "\n\t\t\t\n\t\t\t\tAdvanced Night Repair Eye Concentrate Matrix\n\t\t"
## [3] "\n\t\t\t\n\t\t\t\tAll About Eyes\n\t\t"
## [4] "\n\t\t\t\n\t\t\t\tNeedles No More No More Baggage\n\t\t"
## [5] "\n\t\t\t\n\t\t\t\tHydro Cool Firming Eye Gels\n\t\t"
## [6] "\n\t\t\t\n\t\t\t\tBenefiance Wrinkle Smoothing Eye Cream\n\t\t"
## [7] "\n\t\t\t\n\t\t\t\tAll About Eyes Rich Eye Cream\n\t\t"
## [8] "\n\t\t\t\n\t\t\t\tAge Smart MultiVitamin Power Firm\n\t\t"
## [9] "\n\t\t\t\n\t\t\t\tRose Glow Cooling and Brightening Eye Balm\n\t\t"
## [10] "\n\t\t\t\n\t\t\t\tSwimming Under the Eyes Gel Pads\n\t\t"
## [11] "\n\t\t\t\n\t\t\t\tCreamy Eye Treatment with Avocado\n\t\t"
## [12] "\n\t\t\t\n\t\t\t\tHyaluronic Tripeptide Gel-Cream for Eyes\n\t\t"
## [13] "\n\t\t\t\n\t\t\t\tGlow & Get It Cooling & Brightening Eye Balm\n\t\t"
## [14] "\n\t\t\t\n\t\t\t\tExtreme Firming Eye Cream\n\t\t"
## [15] "\n\t\t\t\n\t\t\t\tOnline Only Advanced Peptides & Collagen Moisturizer\n\t\t"
## [16] "\n\t\t\t\n\t\t\t\tRepairwear Anti-Gravity Eye Cream\n\t\t"
## [17] "\n\t\t\t\n\t\t\t\tLook Alive Eye Balm\n\t\t"
## [18] "\n\t\t\t\n\t\t\t\tGinZing Refreshing Eye Cream to Brighten and Depuff\n\t\t"
## [19] "\n\t\t\t\n\t\t\t\tRénergie Lift Multi-Action Lifting And Firming Eye Cream\n\t\t"
## [20] "\n\t\t\t\n\t\t\t\tMidnight Recovery Eye\n\t\t"
## [21] "\n\t\t\t\n\t\t\t\t24K Gold Pure Luxury Lift & Firm Hydra-Gel Eye Patches\n\t\t"
## [22] "\n\t\t\t\n\t\t\t\t+Retinol Firming Eye Cream\n\t\t"
## [23] "\n\t\t\t\n\t\t\t\tAge Smart Reversal Eye Complex\n\t\t"
## [24] "\n\t\t\t\n\t\t\t\tHydrogel Under-Eye Recovery Patch\n\t\t"
## [25] "\n\t\t\t\n\t\t\t\tLift & Luminate Triple Action Eye Cream\n\t\t"
## [26] "\n\t\t\t\n\t\t\t\tAnti-Wrinkle Miracle Worker+ Line Correcting Eye Cream\n\t\t"
## [27] "\n\t\t\t\n\t\t\t\tAgeless Genius Firming & Wrinkle Smoothing Eye Cream\n\t\t"
## [28] "\n\t\t\t\n\t\t\t\tHyaluronic Eye Cream\n\t\t"
## [29] "\n\t\t\t\n\t\t\t\tCucumber De-Tox Hydra-Gel Eye Patches\n\t\t"
## [30] "\n\t\t\t\n\t\t\t\tStress Positive Eye Lift\n\t\t"
## [31] "\n\t\t\t\n\t\t\t\tHydrogel Collagen Under Eye Patch\n\t\t"
## [32] "\n\t\t\t\n\t\t\t\tClinique Smart Custom-Repair Eye Treatment\n\t\t"
## [33] "\n\t\t\t\n\t\t\t\tSuper Restorative Total Eye Concentrate\n\t\t"
## [34] "\n\t\t\t\n\t\t\t\tMaracuja C Brighter Eye Treatment\n\t\t"
## [35] "\n\t\t\t\n\t\t\t\tMoisture Surge Eye 96-Hour Hydro Filler Concentrate\n\t\t"
## [36] "\n\t\t\t\n\t\t\t\tBrightening Eye Serum\n\t\t"
## [37] "\n\t\t\t\n\t\t\t\tAdvanced Génifique Yeux Light-Pearl Eye & Lash Concentrate\n\t\t"
## [38] "\n\t\t\t\n\t\t\t\tEye Cream\n\t\t"
## [39] "\n\t\t\t\n\t\t\t\tAll About Eyes Serum De-Puffing Eye Massage\n\t\t"
## [40] "\n\t\t\t\n\t\t\t\tIt's Potent Eye Cream\n\t\t"
## [41] "\n\t\t\t\n\t\t\t\tAdvanced Night Repair Eye Supercharged Complex Synchronized Recovery\n\t\t"
## [42] "\n\t\t\t\n\t\t\t\tGénifique Yeux Youth Activating Eye Cream\n\t\t"
## [43] "\n\t\t\t\n\t\t\t\tOnline Only Advanced Peptides & Collagen Eye Cream\n\t\t"
## [44] "\n\t\t\t\n\t\t\t\tRegenerist Retinol 24 Night Eye Cream\n\t\t"
## [45] "\n\t\t\t\n\t\t\t\tRenewing Eye Cream\n\t\t"
## [46] "\n\t\t\t\n\t\t\t\tAuto Correct Eye Cream\n\t\t"
## [47] "\n\t\t\t\n\t\t\t\tHyperlift Eye Instant Eye Fix For Bags, Lines & Crepiness\n\t\t"
## [48] "\n\t\t\t\n\t\t\t\tOnline Only ADVANCED Ceramide Capsules Daily Youth Restoring Eye Serum\n\t\t"
## [49] "\n\t\t\t\n\t\t\t\tRestore & Renew Multi Action Eye Cream\n\t\t"
## [50] "\n\t\t\t\n\t\t\t\tEye Repair Cream\n\t\t"
## [51] "\n\t\t\t\n\t\t\t\tEye Cream\n\t\t"
## [52] "\n\t\t\t\n\t\t\t\tRapid Wrinkle Repair Eye Cream\n\t\t"
## [53] "\n\t\t\t\n\t\t\t\tIntensive Eye Repair\n\t\t"
## [54] "\n\t\t\t\n\t\t\t\tFirming DMAE Eye Lift Cream\n\t\t"
## [55] "\n\t\t\t\n\t\t\t\tOnline Only FlashPatch Rejuvenating Eye Gels\n\t\t"
## [56] "\n\t\t\t\n\t\t\t\tBenefiance WrinkleResist24 Pure Retinol Express Smoothing Eye Mask\n\t\t"
## [57] "\n\t\t\t\n\t\t\t\t360° Tightening Eye Serum\n\t\t"
## [58] "\n\t\t\t\n\t\t\t\tCold Brew Eye Recovery Stick\n\t\t"
## [59] "\n\t\t\t\n\t\t\t\tOnline Only FlashPatch Restoring Night Eye Gels\n\t\t"
## [60] "\n\t\t\t\n\t\t\t\tProtect & Perfect Intense Advanced Eye Cream\n\t\t"
## [61] "\n\t\t\t\n\t\t\t\tYouth Dose Eye Treatment\n\t\t"
## [62] "\n\t\t\t\n\t\t\t\tVisionnaire Eye Cream Advanced Multi-Correcting Eye Balm\n\t\t"
## [63] "\n\t\t\t\n\t\t\t\tSTEM CELLULAR Anti-Wrinkle Eye Treatment\n\t\t"
## [64] "\n\t\t\t\n\t\t\t\tTotal Eye Care SPF15\n\t\t"
## [65] "\n\t\t\t\n\t\t\t\tIntensive Eye Concentrate for Wrinkles\n\t\t"
## [66] "\n\t\t\t\n\t\t\t\tRevive & Rewind Revitalizing Eye Cream\n\t\t"
## [67] "\n\t\t\t\n\t\t\t\tMulti-Spectrum Eye Renewal Serum\n\t\t"
## [68] "\n\t\t\t\n\t\t\t\tInstant FIRMx Eye\n\t\t"
## [69] "\n\t\t\t\n\t\t\t\tAdvanced Retinol Eye Cream\n\t\t"
## [70] "\n\t\t\t\n\t\t\t\tAge Reverse Eye Contour\n\t\t"
## [71] "\n\t\t\t\n\t\t\t\tEven Better Eyes Dark Circle Corrector\n\t\t"
## [72] "\n\t\t\t\n\t\t\t\tFAB Skin Lab Retinol Eye Cream with Triple Hyaluronic Acid\n\t\t"
## [73] "\n\t\t\t\n\t\t\t\tOnline Only Concentrated Eye Cream\n\t\t"
## [74] "\n\t\t\t\n\t\t\t\tGlycolic Eye Cream\n\t\t"
## [75] "\n\t\t\t\n\t\t\t\tPep Start Eye Cream\n\t\t"
## [76] "\n\t\t\t\n\t\t\t\tEssential Fx Acyl-Glutathione Eyelid Lift Serum\n\t\t"
## [77] "\n\t\t\t\n\t\t\t\tIcelandic Relief Eye Cream with Glacial Flower Extract\n\t\t"
## [78] "\n\t\t\t\n\t\t\t\tCold Plasma+ Eye\n\t\t"
## [79] "\n\t\t\t\n\t\t\t\tPeptide 4 Eye Recovery Cream\n\t\t"
## [80] "\n\t\t\t\n\t\t\t\tWater Drench Hyaluronic Cloud Hydra-Gel Eye Patches\n\t\t"
## [81] "\n\t\t\t\n\t\t\t\tHydro Boost Eye Gel-Cream\n\t\t"
## [82] "\n\t\t\t\n\t\t\t\tOnline Only Travel Size FlashPatch Rejuvenating Eye Gels\n\t\t"
## [83] "\n\t\t\t\n\t\t\t\tYouthful Eye Serum\n\t\t"
## [84] "\n\t\t\t\n\t\t\t\tMulti-Correxion 5-in-1 Eye Cream\n\t\t"
## [85] "\n\t\t\t\n\t\t\t\tEyes Ultimate Eye Cream\n\t\t"
## [86] "\n\t\t\t\n\t\t\t\tTotal Effects Anti-Aging Eye Treatment Eye Transforming Cream\n\t\t"
## [87] "\n\t\t\t\n\t\t\t\tOnline Only Glow All Out Gift Set\n\t\t"
## [88] "\n\t\t\t\n\t\t\t\tUltimate Miracle Worker Fix Eye Power-Treatment Fill & Firm\n\t\t"
## [89] "\n\t\t\t\n\t\t\t\tNo-Dark-Circle Perfecting Eye Cream\n\t\t"
## [90] "\n\t\t\t\n\t\t\t\tIntensive Eye Concentrate for Wrinkles\n\t\t"
## [91] "\n\t\t\t\n\t\t\t\tSKINLONGEVITY Vital Power Eye Gel Cream\n\t\t"
## [92] "\n\t\t\t\n\t\t\t\tEye Doctor Moisture Care for Skin Around Eyes\n\t\t"
## [93] "\n\t\t\t\n\t\t\t\tTravel Size Maracuja C-Brighter Eye Treatment\n\t\t"
## [94] "\n\t\t\t\n\t\t\t\tExtra-Firming Eye\n\t\t"
## [95] "\n\t\t\t\n\t\t\t\tWrinkle Warrior Eye Visible Dark Circle Eraser\n\t\t"
## [96] "\n\t\t\t\n\t\t\t\tHope In A Tube\n\t\t"
The product names look good too. But I immediately see a few cases where I need to remove additional text.
Ex: Item # 89 - remove “Online Only” from “ADVANCED Ceramdie Capsules…”
#Data-Preprocessing: removing '\n', '\t' from product names
product_data<-gsub("\n","",product_data)
product_data<-gsub("\t","",product_data)
#Data-Preprocessing: removing 'Online Only' from product names
product_data<-gsub("Online Only","",product_data)
head(product_data, n = 96)
## [1] "Confidence in an Eye Cream"
## [2] "Advanced Night Repair Eye Concentrate Matrix"
## [3] "All About Eyes"
## [4] "Needles No More No More Baggage"
## [5] "Hydro Cool Firming Eye Gels"
## [6] "Benefiance Wrinkle Smoothing Eye Cream"
## [7] "All About Eyes Rich Eye Cream"
## [8] "Age Smart MultiVitamin Power Firm"
## [9] "Rose Glow Cooling and Brightening Eye Balm"
## [10] "Swimming Under the Eyes Gel Pads"
## [11] "Creamy Eye Treatment with Avocado"
## [12] "Hyaluronic Tripeptide Gel-Cream for Eyes"
## [13] "Glow & Get It Cooling & Brightening Eye Balm"
## [14] "Extreme Firming Eye Cream"
## [15] " Advanced Peptides & Collagen Moisturizer"
## [16] "Repairwear Anti-Gravity Eye Cream"
## [17] "Look Alive Eye Balm"
## [18] "GinZing Refreshing Eye Cream to Brighten and Depuff"
## [19] "Rénergie Lift Multi-Action Lifting And Firming Eye Cream"
## [20] "Midnight Recovery Eye"
## [21] "24K Gold Pure Luxury Lift & Firm Hydra-Gel Eye Patches"
## [22] "+Retinol Firming Eye Cream"
## [23] "Age Smart Reversal Eye Complex"
## [24] "Hydrogel Under-Eye Recovery Patch"
## [25] "Lift & Luminate Triple Action Eye Cream"
## [26] "Anti-Wrinkle Miracle Worker+ Line Correcting Eye Cream"
## [27] "Ageless Genius Firming & Wrinkle Smoothing Eye Cream"
## [28] "Hyaluronic Eye Cream"
## [29] "Cucumber De-Tox Hydra-Gel Eye Patches"
## [30] "Stress Positive Eye Lift"
## [31] "Hydrogel Collagen Under Eye Patch"
## [32] "Clinique Smart Custom-Repair Eye Treatment"
## [33] "Super Restorative Total Eye Concentrate"
## [34] "Maracuja C Brighter Eye Treatment"
## [35] "Moisture Surge Eye 96-Hour Hydro Filler Concentrate"
## [36] "Brightening Eye Serum"
## [37] "Advanced Génifique Yeux Light-Pearl Eye & Lash Concentrate"
## [38] "Eye Cream"
## [39] "All About Eyes Serum De-Puffing Eye Massage"
## [40] "It's Potent Eye Cream"
## [41] "Advanced Night Repair Eye Supercharged Complex Synchronized Recovery"
## [42] "Génifique Yeux Youth Activating Eye Cream"
## [43] " Advanced Peptides & Collagen Eye Cream"
## [44] "Regenerist Retinol 24 Night Eye Cream"
## [45] "Renewing Eye Cream"
## [46] "Auto Correct Eye Cream"
## [47] "Hyperlift Eye Instant Eye Fix For Bags, Lines & Crepiness"
## [48] " ADVANCED Ceramide Capsules Daily Youth Restoring Eye Serum"
## [49] "Restore & Renew Multi Action Eye Cream"
## [50] "Eye Repair Cream"
## [51] "Eye Cream"
## [52] "Rapid Wrinkle Repair Eye Cream"
## [53] "Intensive Eye Repair"
## [54] "Firming DMAE Eye Lift Cream"
## [55] " FlashPatch Rejuvenating Eye Gels"
## [56] "Benefiance WrinkleResist24 Pure Retinol Express Smoothing Eye Mask"
## [57] "360° Tightening Eye Serum"
## [58] "Cold Brew Eye Recovery Stick"
## [59] " FlashPatch Restoring Night Eye Gels"
## [60] "Protect & Perfect Intense Advanced Eye Cream"
## [61] "Youth Dose Eye Treatment"
## [62] "Visionnaire Eye Cream Advanced Multi-Correcting Eye Balm"
## [63] "STEM CELLULAR Anti-Wrinkle Eye Treatment"
## [64] "Total Eye Care SPF15"
## [65] "Intensive Eye Concentrate for Wrinkles"
## [66] "Revive & Rewind Revitalizing Eye Cream"
## [67] "Multi-Spectrum Eye Renewal Serum"
## [68] "Instant FIRMx Eye"
## [69] "Advanced Retinol Eye Cream"
## [70] "Age Reverse Eye Contour"
## [71] "Even Better Eyes Dark Circle Corrector"
## [72] "FAB Skin Lab Retinol Eye Cream with Triple Hyaluronic Acid"
## [73] " Concentrated Eye Cream"
## [74] "Glycolic Eye Cream"
## [75] "Pep Start Eye Cream"
## [76] "Essential Fx Acyl-Glutathione Eyelid Lift Serum"
## [77] "Icelandic Relief Eye Cream with Glacial Flower Extract"
## [78] "Cold Plasma+ Eye"
## [79] "Peptide 4 Eye Recovery Cream"
## [80] "Water Drench Hyaluronic Cloud Hydra-Gel Eye Patches"
## [81] "Hydro Boost Eye Gel-Cream"
## [82] " Travel Size FlashPatch Rejuvenating Eye Gels"
## [83] "Youthful Eye Serum"
## [84] "Multi-Correxion 5-in-1 Eye Cream"
## [85] "Eyes Ultimate Eye Cream"
## [86] "Total Effects Anti-Aging Eye Treatment Eye Transforming Cream"
## [87] " Glow All Out Gift Set"
## [88] "Ultimate Miracle Worker Fix Eye Power-Treatment Fill & Firm"
## [89] "No-Dark-Circle Perfecting Eye Cream"
## [90] "Intensive Eye Concentrate for Wrinkles"
## [91] "SKINLONGEVITY Vital Power Eye Gel Cream"
## [92] "Eye Doctor Moisture Care for Skin Around Eyes"
## [93] "Travel Size Maracuja C-Brighter Eye Treatment"
## [94] "Extra-Firming Eye"
## [95] "Wrinkle Warrior Eye Visible Dark Circle Eraser"
## [96] "Hope In A Tube"
“Online Only” is gone, but now I have an extra space.
#Data-Preprocessing: removing extra space before the product name
product_data = trimws(product_data)
head(product_data, n = 96)
## [1] "Confidence in an Eye Cream"
## [2] "Advanced Night Repair Eye Concentrate Matrix"
## [3] "All About Eyes"
## [4] "Needles No More No More Baggage"
## [5] "Hydro Cool Firming Eye Gels"
## [6] "Benefiance Wrinkle Smoothing Eye Cream"
## [7] "All About Eyes Rich Eye Cream"
## [8] "Age Smart MultiVitamin Power Firm"
## [9] "Rose Glow Cooling and Brightening Eye Balm"
## [10] "Swimming Under the Eyes Gel Pads"
## [11] "Creamy Eye Treatment with Avocado"
## [12] "Hyaluronic Tripeptide Gel-Cream for Eyes"
## [13] "Glow & Get It Cooling & Brightening Eye Balm"
## [14] "Extreme Firming Eye Cream"
## [15] "Advanced Peptides & Collagen Moisturizer"
## [16] "Repairwear Anti-Gravity Eye Cream"
## [17] "Look Alive Eye Balm"
## [18] "GinZing Refreshing Eye Cream to Brighten and Depuff"
## [19] "Rénergie Lift Multi-Action Lifting And Firming Eye Cream"
## [20] "Midnight Recovery Eye"
## [21] "24K Gold Pure Luxury Lift & Firm Hydra-Gel Eye Patches"
## [22] "+Retinol Firming Eye Cream"
## [23] "Age Smart Reversal Eye Complex"
## [24] "Hydrogel Under-Eye Recovery Patch"
## [25] "Lift & Luminate Triple Action Eye Cream"
## [26] "Anti-Wrinkle Miracle Worker+ Line Correcting Eye Cream"
## [27] "Ageless Genius Firming & Wrinkle Smoothing Eye Cream"
## [28] "Hyaluronic Eye Cream"
## [29] "Cucumber De-Tox Hydra-Gel Eye Patches"
## [30] "Stress Positive Eye Lift"
## [31] "Hydrogel Collagen Under Eye Patch"
## [32] "Clinique Smart Custom-Repair Eye Treatment"
## [33] "Super Restorative Total Eye Concentrate"
## [34] "Maracuja C Brighter Eye Treatment"
## [35] "Moisture Surge Eye 96-Hour Hydro Filler Concentrate"
## [36] "Brightening Eye Serum"
## [37] "Advanced Génifique Yeux Light-Pearl Eye & Lash Concentrate"
## [38] "Eye Cream"
## [39] "All About Eyes Serum De-Puffing Eye Massage"
## [40] "It's Potent Eye Cream"
## [41] "Advanced Night Repair Eye Supercharged Complex Synchronized Recovery"
## [42] "Génifique Yeux Youth Activating Eye Cream"
## [43] "Advanced Peptides & Collagen Eye Cream"
## [44] "Regenerist Retinol 24 Night Eye Cream"
## [45] "Renewing Eye Cream"
## [46] "Auto Correct Eye Cream"
## [47] "Hyperlift Eye Instant Eye Fix For Bags, Lines & Crepiness"
## [48] "ADVANCED Ceramide Capsules Daily Youth Restoring Eye Serum"
## [49] "Restore & Renew Multi Action Eye Cream"
## [50] "Eye Repair Cream"
## [51] "Eye Cream"
## [52] "Rapid Wrinkle Repair Eye Cream"
## [53] "Intensive Eye Repair"
## [54] "Firming DMAE Eye Lift Cream"
## [55] "FlashPatch Rejuvenating Eye Gels"
## [56] "Benefiance WrinkleResist24 Pure Retinol Express Smoothing Eye Mask"
## [57] "360° Tightening Eye Serum"
## [58] "Cold Brew Eye Recovery Stick"
## [59] "FlashPatch Restoring Night Eye Gels"
## [60] "Protect & Perfect Intense Advanced Eye Cream"
## [61] "Youth Dose Eye Treatment"
## [62] "Visionnaire Eye Cream Advanced Multi-Correcting Eye Balm"
## [63] "STEM CELLULAR Anti-Wrinkle Eye Treatment"
## [64] "Total Eye Care SPF15"
## [65] "Intensive Eye Concentrate for Wrinkles"
## [66] "Revive & Rewind Revitalizing Eye Cream"
## [67] "Multi-Spectrum Eye Renewal Serum"
## [68] "Instant FIRMx Eye"
## [69] "Advanced Retinol Eye Cream"
## [70] "Age Reverse Eye Contour"
## [71] "Even Better Eyes Dark Circle Corrector"
## [72] "FAB Skin Lab Retinol Eye Cream with Triple Hyaluronic Acid"
## [73] "Concentrated Eye Cream"
## [74] "Glycolic Eye Cream"
## [75] "Pep Start Eye Cream"
## [76] "Essential Fx Acyl-Glutathione Eyelid Lift Serum"
## [77] "Icelandic Relief Eye Cream with Glacial Flower Extract"
## [78] "Cold Plasma+ Eye"
## [79] "Peptide 4 Eye Recovery Cream"
## [80] "Water Drench Hyaluronic Cloud Hydra-Gel Eye Patches"
## [81] "Hydro Boost Eye Gel-Cream"
## [82] "Travel Size FlashPatch Rejuvenating Eye Gels"
## [83] "Youthful Eye Serum"
## [84] "Multi-Correxion 5-in-1 Eye Cream"
## [85] "Eyes Ultimate Eye Cream"
## [86] "Total Effects Anti-Aging Eye Treatment Eye Transforming Cream"
## [87] "Glow All Out Gift Set"
## [88] "Ultimate Miracle Worker Fix Eye Power-Treatment Fill & Firm"
## [89] "No-Dark-Circle Perfecting Eye Cream"
## [90] "Intensive Eye Concentrate for Wrinkles"
## [91] "SKINLONGEVITY Vital Power Eye Gel Cream"
## [92] "Eye Doctor Moisture Care for Skin Around Eyes"
## [93] "Travel Size Maracuja C-Brighter Eye Treatment"
## [94] "Extra-Firming Eye"
## [95] "Wrinkle Warrior Eye Visible Dark Circle Eraser"
## [96] "Hope In A Tube"
And it worked!
#Using CSS selectors to scrape the number of reviews
review_html <- html_nodes(webpage,'.prodCellReview')
#Converting the title data to text
review_data <- html_text(review_html)
#Let's have a look at the title
head(review_data, n=5)
## [1] " \n\t\t\t\t\t\t\t\t\t (577 reviews) \n\t\t\t\t\t\t\t"
## [2] " \n\t\t\t\t\t\t\t\t\t (32 reviews) \n\t\t\t\t\t\t\t"
## [3] " \n\t\t\t\t\t\t\t\t\t (1498 reviews) \n\t\t\t\t\t\t\t"
## [4] " \n\t\t\t\t\t\t\t\t\t (157 reviews) \n\t\t\t\t\t\t\t"
## [5] " \n\t\t\t\t\t\t\t\t\t (847 reviews) \n\t\t\t\t\t\t\t"
#Data-Preprocessing: removing '\n', '\t' from number of reviews
review_data<-gsub("\n","",review_data)
review_data<-gsub("\t","",review_data)
review_data<-gsub("\\(","",review_data) # need to enclose paranthesis in square brackets, e.g. [(], or escape it with \\ because it's a metacharacter
review_data<-gsub("reviews)","",review_data)
head(review_data, n = 10)
## [1] " 577 " " 32 " " 1498 " " 157 " " 847 " " 253 "
## [7] " 203 " " 607 " " 7 " " 172 "
#Data-Preprocessing: removing extra space before and after the the numbers (of reviews)
review_data = trimws(review_data)
head(review_data, n = 10)
## [1] "577" "32" "1498" "157" "847" "253" "203" "607" "7" "172"
#Data-Preprocessing: converting review_data to numerical
review_data<-as.numeric(review_data)
Finally, it’s time to put the brand, product name, price, and reviews data into one data frame.
The data frame will contain 4 variables: two categorical (Brand Name, Product Name) and two numerical (Price, Number of Reviews).
It has 96 observations.
#Combining all the lists to form a data frame
ulta_eye_df<-data.frame(Brand = brand_data, Product_Name = product_data, Price = price_data, Reviews = review_data)
str(ulta_eye_df)
## 'data.frame': 96 obs. of 4 variables:
## $ Brand : Factor w/ 42 levels "Ahava","BareMinerals",..: 18 14 6 10 37 36 6 8 42 17 ...
## $ Product_Name: Factor w/ 94 levels "+Retinol Firming Eye Cream",..: 27 6 15 67 51 20 16 12 79 84 ...
## $ Price : num 38 69 33 42 32 70 33 65 30 34 ...
## $ Reviews : num 577 32 1498 157 847 ...
YES!! Whew! Let’s look at this data frame….
DRUMROLL… And we have an absolutely, perfectly clean dataset of Ulta eye creams (page 1 of 4)!
At a future date, I will repeat this process for all the search results pages (there are 3 more pages of eye treatments) and join them together into one large data frame.
But first, let’s make sure this data can be written into a csv file and downloaded.
Here is the code for writing to CSV:
write.csv(MyData, file = “MyData.csv”)
The above writes the data frame MyData into a CSV that it creates called MyData.csv. Note that the file is written to your working directory.
# Write the scraped Ulta eye treatment data to a csv file
write.csv(ulta_eye_df, file = "Ulta_Eye_Treatments.csv")
Go check - the CSV file was successfully created in my working directory.
Let’s have a general look at the prices.
# Calculate summary statistics
summary(ulta_eye_df$Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 27.88 41.00 45.46 62.50 122.00
sd(ulta_eye_df$Price)
## [1] 22.4037
The mean price of an eye treatment (on Page 1) is about $45 with a standard deviation of $22. The lowest price is $10 (Earth Therapeutics) and the highest price is $122 (Perricone MD).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
price_ranking <- ulta_eye_df %>%
arrange(desc(Price))
# Plot a histogram of the frequency of prices
hist(price_ranking$Price, main = "Distribution of Ulta Eye Treatments By Price ($ USD", xlab = "Price", breaks = 20)
# Draw a red line representing the mean price
abline(v=mean(price_ranking$Price), col="red", lwd=3)
Inreasing the number of breaks (breaks = 20 instead of default 10) reveals a bimodal distribution. A peak for lower-priced treatments (low $20’s) and a peak for higher-priced treatments (around $65). This is consistent with my observations as a skincare shopper.
Let’s manipulate the data now to get a deeper look at prices.
# Load dplyr to manipulate the data and ggplot2 to make some plots
library(dplyr)
library(ggplot2)
# Count the # of occurrences of the brand names
ulta_eye_df %>% count(Brand, sort = TRUE)
## # A tibble: 42 x 2
## Brand n
## <fct> <int>
## 1 Clinique 8
## 2 StriVectin 6
## 3 Dermalogica 5
## 4 Derma E 4
## 5 Lancôme 4
## 6 No7 4
## 7 Peter Thomas Roth 4
## 8 Tula 4
## 9 Kiehl's Since 1851 3
## 10 Olay 3
## # … with 32 more rows
The brand with the most number of eye treatments is Clinique, followed by Strivectin, Dermalogica, Derma E, and Lancome.
Hmm interesting, did not expect to see Derma E up there. Is this list correct? Did I use the count function correctly? Let me do a visual check of the data.
Wow, Derma E does have 4 eye treatments. Derma E is not a big brand, which is why I am surprised to see it have so many SKU’s for the eye category.
Let’s get a closer look at the pricing for each brand.
# Create a tibble of the brand names with their average price, minimum price, and maximum price
brand_summary <- ulta_eye_df %>%
group_by(Brand) %>%
summarise(n = n(), averagePrice = mean(Price), minPrice = min(Price), maxPrice = max(Price))
From this tibble, we can quickly see that the Top 5 most expensive brands for eye treaments are: Perricone MD, Murad, Exuviance, Clarins, and Kate Somerville.
A surprise to see Kate Somerville in 5th. I expected this brand to be higher than Murad, Exuviance, and certainly Clarins.
# Set the theme
theme_set(theme_bw())
# Create a bar plot of each brand's average eye treatment price
ggplot(brand_summary, aes(x=Brand, y=averagePrice)) +
geom_bar(stat="identity", width=.5, fill="tomato3") +
labs(title="Average Eye Treatment Prices by Brand at Ulta Beauty (Nov 2019)",
subtitle="This is just Page 1 of 4 for Eye Treatments on Ulta.com",
caption="source: https://www.ulta.com/skin-care-eye-treatments?N=270k") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
Hmm, the brands appear in ALPHABETICAL order. I was expecting the order to be the same order (high to low) as the brand_summary tibble.
Let’s try adding in the arrange function.
# Now arrange the brands by average price (descending order)
brand_ordered_chart <- brand_summary %>%
arrange(desc(averagePrice)) %>%
ggplot(aes(x=Brand, y=averagePrice)) +
geom_bar(stat="identity", width=.5, fill="tomato3") +
labs(title="Average Eye Treatment Prices by Brand at Ulta Beauty (Nov 2019)",
subtitle="This is just Page 1 of 4 for Eye Treatments on Ulta.com",
caption="source: https://www.ulta.com/skin-care-eye-treatments?N=270k") +
theme(axis.text.x = element_text(angle=65, vjust=0.5))
brand_ordered_chart
No change in the plot. No luck after tinkering with the code.
Well, alphabetical order is helpful if you are interested in a particular brand and want to locate it quickly on the chart.
This is a dense graphic, hard to read when there are so many brands. But it gives a good bird’s eye view of peaks and lows.
Is there a relationship between a Product’s Price and the Number of Reviews it gets?
For instance, does a more expensive product get fewer reviews (because fewer people can afford to buy the higher priced product)?
# The following custom theme came from this site: https://www.datanovia.com/en/blog/ggplot-themes-gallery/
# I modified the color scheme. It was originally light blue. I changed it to orange (Ulta's brand color).
theme_orangewhite <- function (base_size = 11, base_family = "") {
theme_bw() %+replace%
theme(
panel.grid.major = element_line(color = "white"),
panel.background = element_rect(fill = "orange"),
panel.border = element_rect(color = "orange", fill = NA),
axis.line = element_line(color = "orange"),
axis.ticks = element_line(color = "orange"),
axis.text = element_text(color = "brown")
)
}
# Create a scatterplot of Price vs Number of Reviews
p1 <- ggplot(price_ranking, aes(x = Price, y = Reviews)) +
ggtitle("Ulta Eye Treatment Price vs Number of Reviews (Nov 2019)") +
xlab("Price of Eye Treatmnet ($ USD)") +
ylab("Number of Reviews for the Eye Treatment") +
theme_orangewhite()
p1 + geom_point()
There does not appear to be a relationship between price and number of reviews.
The extreme outlier is Olay Eyes Ultimate Eye Cream, with 5,477 reviews. Wow, that’s a lot for any site.
Re-plot without this outlier to un-squish the other points. Add a linear regression line with confidence interval.
# Remove the outlier (Olay) from plot using ylim ().
# Add a smoother to the scatterplot
p2 <- ggplot(price_ranking, aes(x = Price, y = Reviews)) +
ggtitle("Ulta Eye Treatment Price vs Number of Reviews (Nov 2019)") +
ylim(0,1500) +
xlab("Price of Eye Treatment ($ USD)") +
ylab("Number of Reviews for the Eye Treatment") +
theme_orangewhite()
p2 + geom_point() + geom_smooth(method='lm', formula = y ~ x)
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Calculate the correlation coefficient.
# Look at the correlation coefficient
cor(price_ranking$Price,price_ranking$Reviews)
## [1] -0.1044596
There is a very weak negative correlation between price and number of reviews.
# Look at the linear model
fit <- lm(Reviews ~ Price, data = price_ranking)
summary(fit)
##
## Call:
## lm(formula = Reviews ~ Price, data = price_ranking)
##
## Residuals:
## Min 1Q Median 3Q Max
## -328.2 -203.1 -131.7 21.9 5177.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 382.810 137.730 2.779 0.00658 **
## Price -2.770 2.721 -1.018 0.31113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 594.1 on 94 degrees of freedom
## Multiple R-squared: 0.01091, Adjusted R-squared: 0.0003896
## F-statistic: 1.037 on 1 and 94 DF, p-value: 0.3111
Model shows high p-value, confirming this is a poor model.
The scatter plot allows you to eyeball the points quickly and see that Ulta’s eye treatments fall mostly in the range of $25-$65. Looking back on the 5-number summary, this checks out. Q1 is $27 and Q3 is $62.
If I were to re-create the same plot for Sephora’s eye treatments, I would expect to see the cloud of points shifted to the right by at least $10-15. I will look forward to checking my guess later.
To be honest, I don’t believe that the number of reviews a beauty product receives is correlated with its price. But I tested this “hypothesis” for this project because that is what the data allowed me to do. Given that my ability to scrape Ulta’s website could yield only two quantitative variables (price and number of reviews), that is what I chose to analyze.
The categorical variables (brand name and product name) could yield some insights if they were classified into new categories, such as the pricing tier the product belongs to (mass/drugstore, indie, prestige, luxury, spa channel, etc.). That was beyond the scope of this project. It would take some time to create a list of brands for each tier, and then write code that could search and compare from this master list.
That said though, this exercise is consistent with research on what drives people to write reviews. According to TrustPilot, a Danish-founded web company that hosts a review platform, the top 3 reasons people write reviews are: “to help others make a better buying decision, to share an experience, or to reward a company for good performance” (Footnote 6).
I was not able to find any research on how the price of a product factors into the decision to review a product. My guess is that it might play a role even if small. For instance, for very large purchases, such as a car, major appliance, etcs. where the stakes are higher when a purchase goes wrong.
The more interesting question is what the number of ratings reveals about a company. Does Ulta have more reviews than Sephora? Or vice versa? More reviews generally means a site has more customers (if you exclude confounding variables like fake reviews), has more engaged customers or customers incented to leave a review, or has been collecting reviews for a longer time. Comparing the number of reviews across beauty retailers and looking at how they garner those reviews would be a good project for a marketing manager.
FOOTNOTES (BIBLIOGRAPHY):
“Why Ulta Beauty Is Winning Customers And Keeps Growing Rapidly,” Watler Loeb, Forbes, March 18, 2019. https://www.forbes.com/sites/walterloeb/2019/03/18/why-ulta-beauty-wins-customers-and-keeps-growing-rapidly/#6a703cc664a6
“Sephora, Ulta And The Battle For The $56B U.S. Beauty Retail Market,” Pamela Danziger, Forbes, August 6, 2018. https://www.forbes.com/sites/pamdanziger/2018/08/06/sephora-and-ulta-are-on-a-collision-course-then-there-is-amazon-where-is-us-beauty-retail-headed/#138839e255dd
“LVMH’s Louis Vuitton and Sephora Brands Are Locking In Teen Shoppers,” Leo Sun, The Motley Fool, April 11, 2019. https://www.fool.com/investing/2019/04/11/lvmhs-louis-vuitton-and-sephora-brands-are-locking.aspx
“We shopped at Sephora and Ulta to see which was a better beauty store — and the winner was clear,” Jessica Tyler, Business Insider, December 18, 2018. https://www.businessinsider.com/sephora-vs-ulta-sales-compared-photos-details-2018-4
“Company Overview,” Ulta.com. http://ir.ultabeauty.com/company-information/company-overview/default.aspx#:~:targetText=Our%20stores%20and%20website%20offer,label%2C%20the%20Ulta%20Beauty%20Collection.
“Why Do People Write Reviews?”, March 7, 2018, TrustPilot. https://business.trustpilot.com/reviews/why-do-people-write-reviews-what-our-research-revealed#:~:targetText=According%20to%20another%20recent%20consumer,both%20men%20and%20women%20internationally.