Melissa Bowman
2022-12-04
For this project, the trading card website called TCG player was scraped for the price guide information from the Pokémon card set called Sword and Shield: Silver Tempest. This price guide was then used on a personal collection of all cards recorded in a csv file to determine how much capital was gained from the purchasing of this set’s cards. After gaining the price total from the personal collection, an exploration of the relationship between card types and price was conducted using linear regression. This was to conclude if there was a relationship between the cards type and if the type that was rarer would dictate collectability pricing.
Load libraries.
Collecting and gathering was the most challenging part of this project. At first, I was using a part of the website where you could not web scrape directly because of the embedded JS. I learned that when websites use this feature and a web page is fetched outside of the browser, the html is not returned just the unexecuted JS. The work around I found was to use a proxy API called ScraperAPI. This website makes it so that you can just imbed their website and an API key with the webpage you are trying to scrape and just web scrape like normal by reading the html link. However, even though I thought I had a solid solution to acquire the data there were still issues with data collection and I’d come to find a much easier solution. On the TCG player website, I discovered that there was a price guide section in which you could directly scrape website card information. So, in the end, I no longer needed ScraperAPI to gather the data.
Once the html was captured from web scraping, the information needed to be parsed and extracted to create a data frame which listed price, name of cards, their card type, and the unique id of the cards. The combination of a Chrome extension to isolate the CSS selectors called SelectorGadget and the html_nodes function in the rvest package was used to call on the information needed. From there, lists were generated from the four elements needed and a data frame was created from those lists.
Create First Dataframe
name = page %>% html_nodes(".product .cellWrapper") %>% html_text()
price = page %>% html_nodes(".marketPrice .cellWrapper") %>% html_text()
id = page %>% html_nodes(".number .cellWrapper") %>% html_text()
rarity = page %>% html_nodes(".rarity .cellWrapper") %>% html_text()
silver_tempest_price_guide = data.frame(name, price, id, rarity, stringsAsFactors = FALSE)Create Second Dataframe
name = page_trainer %>% html_nodes(".product .cellWrapper") %>% html_text()
price = page_trainer %>% html_nodes(".marketPrice .cellWrapper") %>% html_text()
id = page_trainer %>% html_nodes(".number .cellWrapper") %>% html_text()
rarity = page_trainer %>% html_nodes(".rarity .cellWrapper") %>% html_text()
silver_tempest_trainer_price_guide = data.frame(name, price, id, rarity, stringsAsFactors = FALSE)There were two separate listings of the Silver Tempest set that needed to be combined. Those were merged to create one completed data frame.
Tidying of silver_tempest_full Data Frame before Statistical Analysis.
silver_tempest_full <- silver_tempest_full %>%
#removing dollar signs in price
mutate(price = str_remove(price,"\\$")) %>%
#remove all trailing whitespace of coulmns
mutate(name = str_trim(name ,"both")) %>%
mutate(price = str_trim(price ,"both")) %>%
mutate(id = str_trim(id ,"both")) %>%
mutate(rarity = str_trim(rarity ,"both")) %>%
#remove all whitespace in id
mutate(id = str_remove_all(id," ")) %>%
#lowercase of card rarity
mutate(rarity = str_to_lower(rarity))
# converting character type column to numeric
silver_tempest_full <- transform(silver_tempest_full,price = as.numeric(price))
# Remove blank values in id numbers
silver_tempest_full <- silver_tempest_full[!(silver_tempest_full$id ==""),]
## name price id rarity
## 1 Alolan Vulpix V 1.21 033/195 ultra rare
## 2 Alolan Vulpix V (Full Art) 10.17 173/195 ultra rare
## 3 Alolan Vulpix VSTAR 2.44 034/195 ultra rare
## 4 Alolan Vulpix VSTAR (Secret) 15.96 197/195 secret rare
## 5 Altaria 0.06 143/195 uncommon
## 6 Amoonguss 0.07 012/195 rare
After initiating the data from the website, the personal card collection data frame was uploaded. These 3 data frames were created by recording the pack number, unique id, and card type of a sample size of approximately 660 cards. These 3 data frames were then merged to create one completed data frame.
df_box = read.csv('https://raw.githubusercontent.com/melbow2424/Data-606-Final-Project/main/sword_shield_silver_tempest_booster_box_sample%20-%20Sheet1.csv')
df_confirm = read.csv('https://raw.githubusercontent.com/melbow2424/Data-606-Final-Project/main/sword_shield_silver_tempest_confirm_sample%20-%20Sheet1.csv')
df_packs = read.csv('https://raw.githubusercontent.com/melbow2424/Data-606-Final-Project/main/sword_shield_silver_tempest_booster_packs%20-%20Sheet1.csv')
#Removed a column from the data frame of the booster packs
df_packs <- subset(df_packs, select = -c(X))
# Merging the two data frame together to get full booster box sample
df_full_collection = rbind(df_box, df_confirm, df_packs)df_full_collection <- df_full_collection %>%
#lowercase of card rarity
mutate(card_rarity = str_to_lower(card_rarity)) %>%
#remove all trailing whitespace of card rarity
mutate(card_rarity = str_trim(card_rarity ,"both")) %>%
#replace holo v rare with holo rare v
mutate(card_rarity = str_replace_all(card_rarity ,"holo v rare", "holo rare v")) %>%
#remove all whitespace in id
mutate(id = str_remove_all(id," "))## pack_number id card_rarity
## 1 1 080/195 uncommon
## 2 1 155/195 uncommon
## 3 1 075/195 uncommon
## 4 1 037/195 common
## 5 1 106/195 common
## 6 1 054/195 common
Once I had the data from the web site with price information and the personal collection data, I needed to join the two data frames where the prices listed for the web site would be in the personal collection. This was done with the merge function by the id column of the two data frames.
#Merging data frames by id
df_price <- merge(x = df_full_collection, y = silver_tempest_full, by = "id")
#Removing an error card 177/195. Not in set
df_price <- subset(df_price,id != "177/195")
#Removing column rarity
df_price <- df_price %>% select(-rarity)## id pack_number card_rarity name price
## 1 001/195 46 common Venonat 0.09
## 2 001/195 35 common Venonat 0.09
## 3 001/195 3 common Venonat 0.09
## 4 002/195 44 uncommon Venomoth 0.05
## 5 003/195 32 common Spinarak 0.04
## 6 003/195 20 common Spinarak 0.04
From there, the personal collection card prices could be summarized in the total amount.Also separated card less than $0.50 because these cards are sold or traded together in a bulk set. Because the price is taken from the website directly, daily changes to the cards pricing are expected.
#Seperating card prices less then $0.50
df_price_over <- subset(df_price,price > 0.50)
df_price_under <- subset(df_price,price <= 0.50)## [1] 36.89
## [1] 356.49
## [1] 393.38
#Booster box cost
booster_box = 125.00
#Elite Trainer box cost (has 8 booster backs)
elite_trainer_box = 35.99
#Booster pack cost
booster_pack = 3.32
#Total spend on cards
card_spending = booster_box + elite_trainer_box + 22*booster_pack
#Profit made
print(sum(df_price$price)-card_spending)## [1] 159.35
The next thing explored in this project was to see if there is a relationship between card type (rarity) and the price of the cards by using linear regression. First thing done was to take card types and rank them by the number of times the cards were pulled in the personal collection.
## card_rarity n
## 1 common 347
## 2 holo rare 10
## 3 holo rare v 9
## 4 holo rare vstar 2
## 5 radiant rare 4
## 6 rare 56
## 7 secret rare 1
## 8 trainer gallery holo rare 5
## 9 trainer gallery holo rare v 4
## 10 ultra rare 3
## 11 uncommon 217
Common cards where ranked number one because they were most frequently pulled from packs. Radiant rare and trainer gallery holo rare v cards were ranked at the same number because the had the same number of times they showed in the sample size. Secret rare was ranked last because it was pulled only once out of 660 card samples. To show linearity better, I also took out an outlier card that was excessively priced from all other cards.
df_price_rank <- df_price %>%
#lowercase of card rarity
#mutate(card_rank = dense_rank(-desc(count(card_rarity))))
mutate(card_rank = case_when(card_rarity == "common" ~ 1,
card_rarity == "uncommon" ~ 2,
card_rarity == "rare" ~ 3,
card_rarity == "holo rare" ~ 4,
card_rarity == "holo rare v" ~ 5,
card_rarity == "trainer gallery holo rare" ~ 6,
card_rarity == "radiant rare" ~ 7,
card_rarity == "trainer gallery holo rare v" ~ 7,
card_rarity == "ultra rare" ~ 8,
card_rarity == "holo rare vstar" ~ 9,
card_rarity == "secret rare" ~ 10))
# Removing an card outlier.
df_price_rank <- subset(df_price_rank ,id != "186/195")ggplot(data = df_price_rank , aes(x = price, y = card_rank)) +
geom_jitter(shape = 1)+
geom_smooth(method = "lm", se = FALSE)##
## Call:
## lm(formula = card_rank ~ price, data = df_price_rank)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9153 -0.6877 -0.6838 0.3083 6.7867
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.67190 0.04171 40.08 <2e-16 ***
## price 0.39625 0.02536 15.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.057 on 655 degrees of freedom
## Multiple R-squared: 0.2716, Adjusted R-squared: 0.2704
## F-statistic: 244.2 on 1 and 655 DF, p-value: < 2.2e-16
ggplot(data = m_bty, aes(x = .fitted, y = .resid)) +
geom_jitter(shape = 1) +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")Plotting price verse card rank to see if any linear relationship. At a glance, it does not look like there is a linear relationship but just to double check I created a residual plot and a normal probability plot to verify. None of this show linearity, therefore it was concluded that there is no linear relationship between the card type and price. This was an anticipated outcome that there may not be a direct linear relationship between card type and price for these collectable cards. This is due to the fact that the most valuable card in the set is an ultra-rare card (not a secret rare card type) called Lugia V (Alternate Full Art) which is worth approximately $220.00.
Using web scraping to create a pricing guide can be difficult if, the
website used has embedded JavaScript in their HTML code, what is being
scraped is a dynamic webpage, or if the website has anti-scraping
technology. When web scraping becomes a complex exercise, it’s
importance to recognize here that this the sector of software
engineering known as data engineering.
As for this project,
there was a direct website to scrape from and wasn’t as complex as
initially suspected thus the data was easy to acquire. This made the
price guide a straightforward process with R coding knowledge. As for
the card type being related through price of cards, it was clear after
testing the relationship with linear regression that there was no
foundation for any relational commonality.