In a previous post on Levana 2 we explored the correlation between various variables and placing a bid on the Levana Meteor NFT Minting Event.
In this article we train a machine learning model that seeks to predict the outcome of the Levana NFT auction. We use the model on a test dataset to see how it performs. The model predicts 111 bids valued at a total combined amount of $1,400,945.00. In reality 2261 bids were received for a total amount of $1,444,450.00
The machine learning model is a decision tree that uses the following 24 variables.
For each address:
For addresses that placed bids: 1. Wallet balance (total_usd_balance)link to query on Flipside Crypto.
Flipside Crypto’s API allows us to pull query results from the Velocity platform.
The following function is used to import our data:
PullVelocityData <- function(endpoint.url){
raw.results <- readLines(endpoint.url)
to.return <- data.table(jsonlite::fromJSON(raw.results))
setnames(to.return, tolower(names(to.return)))
return(to.return)
}
We import the query results from Velocity. Some data was also downloaded from the LUNAtic rankings website.
total_bid_price_and_wallet_balances <- PullVelocityData("https://api.flipsidecrypto.com/api/v2/queries/b590b2d5-6e8d-47bb-8baa-032921043550/data/latest")
lunatic_scores <- read.csv("Lunatic Scores.csv")
Addresses_purchased_NFTs_Random_Earth <- PullVelocityData("https://api.flipsidecrypto.com/api/v2/queries/a086c115-6f35-469a-80af-2e63a1d2f42f/data/latest")
punk_owners <- PullVelocityData("https://api.flipsidecrypto.com/api/v2/queries/2e45401d-34c7-44cb-80d2-2b0773d0b540/data/latest")
#Nov_07_Wallet_Balance_of_Current_Punk_Owners <- PullVelocityData("https://api.flipsidecrypto.com/api/v2/queries/09fe61e8-0f09-44ed-8b1a-34df073d7107/data/latest")
#Nov_7_wallet_balances_of_degeneracy_addresses <- #PullVelocityData("https://api.flipsidecrypto.com/api/v2/queries/f46cb189-32a4-4426#-a020-6739110985e2/data/latest")
levana.data5 <- lunatic_scores %>% mutate(placed_a_bid = ifelse(lunatic_scores$address %in% total_bid_price_and_wallet_balances$address, 1, 0))
levana.data6 <- levana.data5 %>% left_join(Addresses_purchased_NFTs_Random_Earth)
#replacing NA with 0 (purchases)
levana.data6 <- levana.data6 %>% mutate(nb_purchases = coalesce(nb_purchases, 0))
levana.data7 <- levana.data6 %>% left_join(punk_owners)
#replacing NA with 0 (nb punks owned = count(tokenid))
levana.data7 <- levana.data7 %>% mutate(`count(tokenid)` = coalesce(`count(tokenid)`, 0))
#Nov_7_wallet_balances_of_degeneracy_addresses <- PullVelocityData("https://api.flipsidecrypto.com/api/v2/queries/f46cb189-32a4-4426-a020-6739110985e2/data/latest")
#adding nov 07 wallet balance of addresses with a degeneracy score
#levana.data8 <- levana.data7 %>% left_join(Nov_7_wallet_balances_of_degeneracy_addresses)
#total_bid_price <- subset(total_bid_price_and_wallet_balances, select = -c(total_usd_balance))
levana.data9 <- levana.data7 %>% left_join(total_bid_price_and_wallet_balances)
#replacing NA with 0 (purchases)
levana.data9 <- levana.data9 %>% mutate(total_ust_bid = coalesce(total_ust_bid, 0))
The resulting dataset has 746 701 unique addresses. It includes which addresses submitted a bid for the Levana NFT auction. Also included is the total amount bid by each address during the auction.
head(levana.data9, 3)
## X address Total.Score Activity Airdrops
## 1 1 terra100023m27dq557redx2pt7ugnvy5h3mmac8ktlw 9 6 0
## 2 2 terra10002xprwsf3fetzkvejtlueljhglzdtn24jtx4 1 0 0
## 3 3 terra10003hmrw9pxyla6wjejpe3wf6xc272cvekhpxq 1 0 0
## Cash.Out.vs.HODL Degeneracy Governance days_since_last_txn
## 1 0 3 0 2
## 2 1 0 0 0
## 3 1 0 0 0
## n_airdrop_and_gov_stakes n_airdrops_claimed n_contracts n_dex_trades
## 1 0 0 3 1
## 2 0 0 0 0
## 3 0 0 0 0
## n_governance_votes n_lp_deposits n_projects_staked n_protocols_claimed
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## n_tokens_used n_transactions net_from_shuttle_cex prop_drops_kept
## 1 2 1 0 0
## 2 0 0 1 0
## 3 0 0 1 0
## prop_luna_staked repeat_protocol_claims placed_a_bid nb_purchases
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## count(tokenid) total_usd_balance total_ust_bid
## 1 0 NA 0
## 2 0 NA 0
## 3 0 NA 0
Exploring the dataset
a1 <- ggplot(total_bid_price_and_wallet_balances, aes(x = total_usd_balance, y = total_ust_bid))
plot(a1 + geom_point(color = "dark green")+ labs(
subtitle = "Total amount bid by each address vs wallet balance") +
stat_cor(method = "pearson", labels = "AUTO"))
There is a 0.35 correlation between an adresses’ wallet balance and the total amount bid. Note that this is from a list of addresses that are known to have bid.
Our fist model attempts to predict the total number of bids received from our list of addresses. In other words, it tries to predict who will place a bid at the auction.
We partition our data into a training set and a validation set. This is an essential step in the creation of machine learning models. We explain how each partition is used later on.
set.seed(999)
levana.data9_no_address <- subset(levana.data9, select = -c(X, address))
levana_partition <- createDataPartition(y = levana.data9_no_address$placed_a_bid, p = 0.6, list = FALSE, times = 1)
entrainement <- levana.data9_no_address %>% slice(levana_partition)
entrainement_mutate_placed_a_bid <- entrainement %>% mutate(placed_a_bid = as.integer(placed_a_bid))#includes placed a bid and total usd bid
test_data <- levana.data9_no_address %>% slice(-levana_partition)#includes placed a bid and total_ust_bid
test_data <- test_data %>% mutate(placed_a_bid = as.integer(placed_a_bid))
entrainement_X = select(entrainement_mutate_placed_a_bid, 1:25) #includes placed a bid and wallet balance, but not total amount bid
entrainement_X_no_outcome = select(entrainement_mutate_placed_a_bid, 1:21,23,24,25) #does not include placed a bid nor total ust bid. Includes wallet balance
entrainement_Y_outcome = select(entrainement, 22) #placed a bid
test_X = select(test_data, 1:25)#includes placed a bid and wallet balance but not total ust bid
test_X <- test_X %>% mutate(placed_a_bid = as.integer(placed_a_bid))
test_X_no_outcome = select(test_data, 1:21,23,24,25)#does not include placed a bid nor total ust bid. Includes wallet balance
test_Y_outcome = select(test_data, 22) #placed a bid
test_Y_outcome <- test_Y_outcome %>% mutate(placed_a_bid=as.integer(placed_a_bid))
Let’s use machine learning to see if we can estimate the number of bids received. The prediction will be made using our 24 variables of interest.
First, we train a model using our training dataset. This dataset does include the addresses that placed a bid. The model is fed the outcome variable. We train it based on the outcome variable (placed_a_bid) and the 24 input variables.
The problem is known as a classification problem. Our outcome variable (placed_a_bid) is either True (1) or False (0).
fit <- rpart(as.factor(placed_a_bid)~., data = entrainement_X, method = 'class', xval = 30, maxdepth = 30, minsplit = 5, maxsurrogate = 10, cp = 0.001)
#takes 1 min to run
The model is ready. Now comes the crucial step. We feed it a new dataset without the outcome variable. The model has never seen this dataset and does not know which addresses placed a bid. It will try to predict how many bids were received from the addresses in this dataset.
predict_matrix_bids <- predict(fit, test_X_no_outcome, type = "class")
Let’s see how many bids our model expects.
summary(predict_matrix_bids)
## 0 1
## 298569 111
It predicts 111 bids.
Let’s compare this prediction against the real-world outcome.
count(test_X, placed_a_bid)
## placed_a_bid n
## 1 0 296419
## 2 1 2261
In reality 2261 bids were received from these addresses. Our model was not accurate but let’s not give up too soon.
Our decision tree performed poorly on a classification model. Maybe it will work better on another type of problem. This time we will try to predict the total amount received from a list of known bidders. We will let the model know which addresses placed a bid. It will try to predict the amounts that were bid. Our outcome variable (total_ust_bid) is a continuous variable. This type of problem is known as a regression.
Again, we begin with a training dataset that includes the outcome variable (total_ust_bid). It should be noted that the full dataset of 746 701 observations was previously split into two parts. The first part is a training dataset which includes 60% of all observations. This is the dataset that we are using now to train the model.
fit <- rpart(total_ust_bid ~., data = entrainement_mutate_placed_a_bid, method = 'anova', xval = 30, maxdepth = 30, minsplit = 5, maxsurrogate = 10, cp = 0.001)
The model we just built is called a decision tree. It can be visualized in Appendix 1.
The amount bid by an actual person seems like something that would be hard to predict. Can a decision tree make an accurate prediction about that?
Let’s see how the decision tree performs. We feed it our test dataset without the outcome variable (total_ust_bid). The model will use the other 24 input variables to predict how much each address will bid. It should be noted that the model knows which addresses placed a bid (but not how much each has bid).
predict_matrix <- predict(fit, test_X, type = "matrix")
sum(predict_matrix)
## [1] 1400945
The model predicts a total combined bid amount of $1,400,945.00.
How much did these addresses actually bid?
sum(test_data$total_ust_bid)
## [1] 1444450
$1,444,450.00
This is very close to our model’s prediction.
Let’s produce some stats on accuracy.
postResample(pred = predict_matrix, obs = test_data$total_ust_bid)
## RMSE Rsquared MAE
## 294.30196066 0.04245073 6.33341786
Let’s ignore Rsquared for now.
Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are metrics used to evaluate a Regression Model. These metrics tell us how accurate our predictions are and the amount of deviation from the actual values.
R Squared is used for explaining how well the 24 input variables explain the variability in the outcome variable. R Square ranges from 0 to 1, with values closer to 1 being preferable.
All three of our accuracy scores are very poor. This means that there is a gap between predicted and actual data points. The model made inaccurate predictions about how much each individual address bid. Overall though, the predicted value of the total sum of all bids combined was very close to reality.
Importance of each variable
varimp3 <- varImp(fit)
varimp3
## Overall
## Activity 0.16742650
## Airdrops 0.21363997
## Cash.Out.vs.HODL 1.93639317
## count(tokenid) 1.68415217
## Degeneracy 1.56755272
## Governance 2.73582890
## n_airdrop_and_gov_stakes 0.69239172
## n_airdrops_claimed 0.21363997
## n_contracts 0.11078704
## n_governance_votes 1.73801885
## n_lp_deposits 1.40235767
## n_projects_staked 0.04644551
## n_protocols_claimed 0.20188006
## n_tokens_used 0.00916158
## nb_purchases 3.19593099
## net_from_shuttle_cex 0.41644006
## placed_a_bid 0.10750935
## prop_drops_kept 1.56964817
## prop_luna_staked 0.64764901
## Total.Score 5.99081677
## total_usd_balance 9.84972508
## days_since_last_txn 0.00000000
## n_dex_trades 0.00000000
## n_transactions 0.00000000
## repeat_protocol_claims 0.00000000
## `count(tokenid)` 0.00000000
The variables which were more heavily weighted are the following:
The above variables had the largest impact on the prediction.
Plotting Predicted vs Real Bids
test_fitted_real <- tibble(real_y = test_data$total_ust_bid,
pred_y = predict_matrix,
model = 'decision_tree_regression')
ggplot(data = test_fitted_real,
aes(x = test_data$total_ust_bid,
y = predict_matrix,color = model)) +
geom_point() +
geom_point(aes(x = test_data$total_ust_bid,
y = test_data$total_ust_bid,
color = 'ideal')) +
coord_fixed()
This graph shows bids amounts by each individual address. In turquoise are the real bids while predicted bids are in red. Accuracy is quite poor for individual bids. For example, two outlier bids are found on the top right of the chart. The decision tree failed to predict these two large bids. The model is inaccurate for bid amounts of individual addresses. However, on aggregate it is accurate when adding all the bids together.
Based on a dataset of 746 701 addresses and 24 input variables, we trained two machine learning models called decision trees. The objective was to make predictions about the Levana Meteor NFT minting auction.
The first model failed to predict the total number of bids received (111 VS 2261). It may be that a decision tree was not the right type of model to choose for this type of classification problem. Additionally, we suspect that a better prediction could be made by integrating additional variables. For example, the search history of each address may be quite useful in predicting who will bid.
The second model was remarkably accurate in predicting the sum of all bids received ($1,400,945.00 vs $1,444,450.00). However, the prediction on bid amounts for individual addresses was inaccurate.
Why is this interesting?
NFT creators are considering various ways to auction their work. Some are looking for ways to ensure the sale is perceived as fair. Others want the auction event to be available to loyal users of certain blockchains. Some NFT creators might simply want to maximize their profit.
Whitelists are often used to grant potential bidders exclusive access to an NFT auction. How should NFT creators choose who to let into these exclusive whitelists? The information that we tried to predict should be of great interest to NFT auctioneers. When preparing a whitelist of exclusive bidders, knowing which addresses will actually place a bid as well as the total expected amount raised by the auction is crucial information.
The input variables that we used are from publicly available data.
How would the decision tree perform in a completely new auction? Auctions have different characteristics, and the broader crypto macro environment will change with time. Interest in NFT auctions will also change with time.
Some of the input variables were queried after the Levana auction. Because the purpose is to make a prediction it would be best to use input variables dated from before the auction took place.
According to the ‘varimp’ function, some input variables were not assigned any importance by the decision tree. For example, the number of galactic punks owned by addresses was assigned 0 importance. On the other hand, when visualizing the full decision tree, the number of punks owned does appear in some decision nodes. This is puzzling.
Feedback on this paper would be much appreciated.
Specifically:
Thank you for taking the time to read this paper. Feedback can be submitted to Zook#2707 or @streamust.
Special thanks to:
Decision trees are a mainstream data mining technique a form of supervised machine learning. t is like a flowchart but having a structure of a tree. The branch at the end of the decision tree displays the prediction. (Source: upgrad.com)
Shown below is the decision tree that we built for predicting the total amount bid.
par(mar = rep(0.2, 4))
plot(fit, branch = 0.8,uniform = TRUE, compress = TRUE)
text(fit, all = TRUE, use.n = TRUE)
The decision tree is too large to be visualized easily. Let’s zoom in on specific parts of the tree to understand how it works.
Decision Tree, First Example
The top of the tree is called the root node. It includes 448 021 (n) addresses. The root node filters addresses based on their wallet balance. Addresses with less than $6,927,000 are sent to the left branch while addresses with a higher balance are sent to the right.
The next node on the left filters addresses yet again. Addresses which did not place a bid (n=448 016) are sent to a terminal node on the left. The terminal node predicts that these addresses bid 0$. Obviously, this is correct since these addresses did not place a bid.
Let’s look at another example.
Decision Tree, Second Example
Again, we begin at the root node. Addresses with a wallet balance under $6,927,000 at sent to the left. The second node sends addresses that placed a bid to the right. The third node seperates addresses yet again on their wallet balance. Addresses with balances greater than $143,100 are sent to the right. The fourth node separates addresses based on how many Galactic Punks NFT they possess. Ten addresses with more than five punks are sent to the right. Finally, the terminal node separates addresses that currently stake over 75% of their LUNA balance. Two addresses meet all the attributes above. The predicted bid amount of these addresses is $17,420.