I run a small pottery studio out of my home as a side business to my day job at UMaine. While I mainly sell to shops and at craft fairs throughout New England, I have occasionally sold a few pieces on Etsy too. With COVID-19 canceling all my fairs this year, I have considered ramping up my Etsy presence. While doing some research and looking around the site, I was struck by the “Etsy Design Award Winners” page and was curious about the items featured. This report will be looking at the award winning items along price, shop reviews, and whether or not items ship for free to see if anything specific or interesting sticks out.
The data for this project was procured by web scraping the Etsy page featuring the design winners. No intellectual property has been stolen in this process and I kept my scraping session as short as possible so as not to degrade the site quality. Given that web scraping programs are brittle and it seemed semi likely that Etsy might change or remove the page I am working with with during the course of the project, I archived the page here as a back up.
library(rvest)
library(stringr)
library(reshape2)
library(tidyverse)
library(rsconnect)
library(knitr)
etsy_award_winners <- read_html("https://www.etsy.com/featured/etsydesignawards?ref=finds_c")
Below is a table of the 25 most expensive items from the Etsy Design award winners list. While I was unsurprised that there were shops that made this list multiple times with different items, I was slightly shocked that there were extremely expensive, award winning items from shops with so few (or in some cases no) reviews. This seems to speak to items being chosen only on the visual impression of the work and without taking into account the shop’s broader contribution to the platform, though there are certainly outliers to this theory.
shop_names <- etsy_award_winners %>% html_nodes("p.display-inline-block") %>% html_text()
total_shop_reviews <- etsy_award_winners %>% html_nodes(".icon-b-1") %>% html_text()
total_shop_reviews1 <- total_shop_reviews %>% str_replace_all("\\(", "") %>% str_replace_all("\\)", "")
total_shop_reviews1[100] <-""
total_shop_reviews2 <- total_shop_reviews1[c(1:17,25,18:22,78,23:24,26:64,100,65:77,79:99)]
free_shipping_price <- etsy_award_winners %>% html_nodes(".text-body-larger , .wt-badge--sale-01") %>% html_text()
free_shipping_price1 <- free_shipping_price %>% str_replace_all(" ", "") %>% str_replace_all("\n", "") %>% str_replace_all("FREEshipping", "&FREE shipping")
free_shipping_price2 <- free_shipping_price1[free_shipping_price1 != "&FREE shipping"]
free_shipping_price3 <- colsplit(free_shipping_price2, "\\&", names = c("price", "shipping"))
award_winners <- cbind(shop_names, total_shop_reviews2, free_shipping_price3)
award_winners1 <- award_winners[-c(23,35,39,45,67),]
award_winners1$total_shop_reviews2 <- as.numeric(gsub(",", "",award_winners1$total_shop_reviews2))
award_winners1$price <- as.numeric(gsub("[\\, $]","", award_winners1$price))
award_winners1$shipping[award_winners1$shipping ==""] <- "CFR"
award_winners1<- award_winners1 %>% arrange(desc(price))
kable(head(award_winners1,25), padding = 10)
| shop_names | total_shop_reviews2 | price | shipping |
|---|---|---|---|
| MineralogyDesign | 538 | 8000.00 | CFR |
| artemer | 832 | 5430.00 | FREE shipping |
| wrenandcooper | NA | 5350.00 | CFR |
| AdrianMartinus | 653 | 4490.00 | FREE shipping |
| AdrianMartinus | 653 | 4490.00 | FREE shipping |
| DreamersandLovers | 177 | 1955.00 | FREE shipping |
| AdrianMartinus | 653 | 1840.00 | FREE shipping |
| SumarokovaAtelier | 201 | 1489.00 | FREE shipping |
| DemiMacrameDesigns | 9 | 1300.00 | FREE shipping |
| DemiMacrameDesigns | 15 | 1300.00 | FREE shipping |
| PatienceAndGough | 5 | 1010.81 | CFR |
| WardrobeByDulcinea | 676 | 998.00 | FREE shipping |
| WardrobeByDulcinea | 676 | 998.00 | FREE shipping |
| sibodesigns | 548 | 564.00 | FREE shipping |
| DodoLeather | 923 | 550.00 | FREE shipping |
| LABBVENN | 6 | 505.00 | CFR |
| ATUKO | 62 | 494.00 | FREE shipping |
| TeslerMendelovitch | 476 | 480.00 | CFR |
| TeslerMendelovitch | NA | 480.00 | CFR |
| maisolorzano | 103 | 476.35 | FREE shipping |
| maisolorzano | 29 | 476.35 | FREE shipping |
| FashionforFables | 236 | 316.30 | CFR |
| FashionforFables | 236 | 316.30 | CFR |
| PaniJurek | 81 | 305.00 | CFR |
| PaniJurek | 81 | 305.00 | CFR |
Below is a plot looking at the availability of free shipping vs CFR and comparing that to the number of reviews the corresponding shops have. In my experience, everyone who I have ever refunded a shipping amount to (because it cost less to ship than they paid) has left me a positive review. However, none of them mentioned that fact in their actual review. I suspect that the amount a person pays for shipping has both a conscious and unconscious effect on whether they are vocally happy with the item they purchased.
ggplot( data= award_winners1, aes(x= shipping, y= total_shop_reviews2)) +
geom_boxplot() +
ylim(0,2000)
It appears items that shipped free had more reviews than items that had CFR shipping. However, there was a lot more variance within the free shipping data. To dig into this further, I also wanted to account for the price of the items. Below is a plot looking at the price of an item compared to how many reviews the shop that is selling it has. Overlaid are two lines. The blue line is the average impact of price on reviews for items that shipped for free and the red line is the average impact of price on reviews for item that shipped CFR.
summary(lm(total_shop_reviews2 ~ shipping*price, data = award_winners1))
##
## Call:
## lm(formula = total_shop_reviews2 ~ shipping * price, data = award_winners1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -761.5 -678.1 -480.9 -12.4 9801.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 755.77479 275.30378 2.745 0.00752 **
## shippingFREE shipping 28.82577 377.53888 0.076 0.93934
## price -0.02751 0.19731 -0.139 0.88946
## shippingFREE shipping:price -0.03729 0.27404 -0.136 0.89211
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1542 on 77 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.001759, Adjusted R-squared: -0.03713
## F-statistic: 0.04523 on 3 and 77 DF, p-value: 0.9871
ggplot( data= award_winners1, aes(x= price, y= total_shop_reviews2, color =shipping)) +
geom_point(alpha = .5) +
geom_abline(slope= 0.6994, intercept =255.7162, legend=TRUE, color = "tomato") +
geom_abline(slope =0.6994-0.8443, intercept =255.7162 +674.0918, legend = TRUE, color = "cyan3") +
xlim(0, 2000) +
ylim(0, 2600)
Items that shipped for free were slightly more likely to receive reviews. That said, price and shipping together seem to have a much larger impact. Unfortunately, with only 95 usable observations- a handful of which were outliers- there is a lot or variation in this data. A larger sample would, perhaps, help draw a more substantive conclusion. I could also get more stable answers by taking the log of my data, but that seems slightly outside the scope of this project.
I definitely struggled with the data cleaning aspect of this assignment. There were some missing shop reviews that came through out of order (except for 1) or were outright missing. After much trial and error, I manually adjusted them, which wouldn’t be sustainable in a larger data set. I struggled to adjust and fix random html tags that came through in odd places and it was overall a bit of a slog. Web scraping is certainly valuable and saves quite a bit of time but it’s definitely not without its own pitfalls and frustrations.