A NFT is a digital asset that represents real-world objects like art, music, in-game items and videos. They are bought and sold online, frequently with cryptocurrency, and they are generally encoded with the same underlying software as many cryptos. Although they’ve been around since 2014, NFTs are gaining notoriety now because they are becoming an increasingly popular way to buy and sell digital artwork. A staggering $174 million has been spent on NFTs since November 2017.1 The purpose of this project is to preform exploratory data analysis and determine what variables are most positively correlated with an increase in sales. This will help future NFT creators to determine what they should focus on when listing their pieces.
The data set used in this project was pulled on January 16, 2022, and represents all time information for the top NFT collections. As an example, the Sales column represents all sales under a specified NFT collection from its creation up until January 16, 2022.
The data set used consists of the following information:
From this I would like to look more closely at the correlation of owners and assets in relation to sales, to determine if an increase in strictly owners leads to more sales, if an increase in strictly assets leads to more sales, or if an increase in both leads to more total sales.
First, I will plot a linear regression model that relates sales to owners.
As expected the linear model shows a positively correlated relationship between the two variables, suggesting that as the number of owners increases so increases the total sales of a collection. Next I will plot the linear regression model that relates sales to assets.
Once more this shows that there is a positively correlated relationship between the two variables, suggesting that as the number of assets in a collection increases so increases the total sales of a collection. When comparing this model to the model relating sales to owners, it is obvious, when observing the linear regression line, that there is more variance in the response variable that can be explained by the predictor variables in this model. I will cover how I fully determine this later.
I will also plot a basic multi-linear regression model relating sales to owners and assets. This will determine if both predictor variables improve or degrade the coefficient of determination.
When viewing the multi-linear regression plane I expected to see that the plane would suggest there is a positive correlation between the two variables(owners & assets) related to sales, and that is exactly what this initial model shows.
(Intercept) Assets Owners
13.4445201 0.1637544 2.4555063
\(b_1\): All else held constant, for an increase in assets, we expect the sales to be increase, on average, by 0.16 total sales.
\(b_2\): All else held constant, one additional owner causes the sales, on average, to increase by 2.45 total sales.
The \(R_{Adj}^2\) determines if additional input variables are contributing to the model.2 With that being said, I will now be comparing the \(R_{Adj}^2\) for the three models used above to determine the best fit model, and will continue with the best model going forward.
Multi-Linear Regression’s \(R_{Adj}^2\)
[1] 0.6588068
Simple-Linear Regression (Sales vs Assets) \(R_{Adj}^2\)
[1] 0.5674074
Simple-Linear Regression (Sales vs Owners) \(R_{Adj}^2\)
[1] 0.6579407
The tells me that in the Multi-Linear regression model there is more variance in the response variable that can be explained by the predictor variables in the model. As a result I want to continue to use the Multi-Linear regression going forward.
When reviewing the ANOVA results above, it is obvious that the p-value (1.084013e-138) is less than the significance level (0.05). As a result I can reject the null hypothesis. This suggests that there is in fact an effect that exists in the data. Next I will be preforming marginal tests on \(\beta_1\) & \(\beta_2\)
\(H_0: \beta_1 = 0\)
\(H_1: \beta_2 = 0\)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.4445201 11.9573068 1.124377 2.613112e-01
Assets 0.1637544 0.1036157 1.580401 1.145520e-01
Owners 2.4555063 0.1947035 12.611513 1.890403e-32
Given that the p-value for assets, or \(\beta_1\), (1.145520e-01) is less than the t-value (1.580401) I will reject the \(H_0\) hypothesis and conclude that \(\beta_1 \not= 0\). Similarly the p-value for owners, or \(\beta_2\) (1.890403e-32) is less than the t-value (12.611513) I will also reject the \(H_1\) hypothesis and conclude that \(\beta_2 \not= 0\). This follows from the ANOVA test as shown above, and further suggests that there is in fact an effect that exists in the data.
\(\beta_1\) Confidence Interval
2.5 % 97.5 %
-0.03974689 0.36725562
\(\beta_2\) Confidence Interval
2.5 % 97.5 %
2.073109 2.837904
These confidence intervals show that, given the model, I am 95% confident that \(\beta_1\) lies somewhere between -0.04 and 0.37 and that \(\beta_2\) lies somewhere between 2.07 and 2.84.
I will be using Cook’s Distance \(D_i\) application for measuring influence. This measures the squared distance that the vector of fitted values moves when the \(i\)th observation is deleted.
8 20 28
3.934975 1.530798 1.645267
I will further pair this with the \(COVRATIO_i\) to determine whether or not the points found from Cook’s Distance \(D_i\) improve or degrade the precision
6 9 10 16 28 35 44 48
1.034268 1.042111 1.015910 1.067555 1.555991 1.018704 1.095826 1.053328
53 55 90 103 107 243 244 246
1.017457 1.141492 1.050981 1.058377 1.198114 1.016917 1.038842 1.015512
349 363 376 586
1.039933 1.105400 1.015221 1.027835
First I know that the points above improve the precision as \(COVRATIO_i > 1\). I can see that 28 is in those points that improve precision, so I will not consider removal of the influential point.
2 5 7 8 11 12 15 20
0.7893084 0.9771029 0.9734095 0.5821241 0.8080388 0.9832904 0.9590974 0.5816751
25 84 95 109 131 136 419 449
0.9646836 0.9386441 0.9778603 0.9826148 0.9827753 0.9801984 0.9171562 0.8879161
573
0.9231860
Next I show the points that degrade the precision as \(COVRATIO_i < 1\). I can see that both 8 and 20 are in those point that degrade precision so I will continue to consider removal of the points.
Given these observations I will plot the R-Student residuals to
detect the abnormality in the model.
[1] 8 20
As expected this suggests that the two points are in fact degrading the precision, so I will make a new model that removes the two values and test again.
New Multi-Linear Regression’s \(R_{Adj}^2\)
[1] 0.6687642
As expected with the removal of the degrading points the variance is up from 0.6588068 to 0.6687642 for an increase in about 0.0099574. This means that with the removal of the degrading points the variance explained by the predictor variables has improved, making for a more precise model.
I’d like to go further and remove more outliers using the R-Student residuals to detect the abnormality in the new model.
2 11
2 10
I will do this and compare \(R_{Adj}^2\) values until I reach a conclusion that the model cannot be any more precise
[1] 0.7024546
11 449
9 445
[1] 0.7246209
84 449
78 443
[1] 0.7167459
Now that the \(R_{Adj}^2\) value has began to degenerate I will use the model before this to test assumptions and hypotheses
I will begin by be re-preforming an ANOVA test, comparing the null and full models. My hypotheses for this test will be:When reviewing the ANOVA results above, it is obvious that the p-value (2.024551e-164) is less than the significance level (0.05). As a result I can reject the null hypothesis. This again suggests that there is still in fact an effect that exists in the data.
Next I will again preform marginal tests on \(\beta_1\) & \(\beta_2\)
\(H_0: \beta_1 = 0\)
\(H_1: \beta_2 = 0\)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.0240464 7.86207910 2.546915 1.112382e-02
Assets 0.3127757 0.06861386 4.558492 6.280871e-06
Owners 1.6728212 0.13199551 12.673319 1.106081e-32
Given that the p-value for assets, or \(\beta_1\), (6.280871e-06) is less than the t-value (4.558492) I will again reject the \(H_0\) hypothesis and conclude that \(\beta_1 \not= 0\). Furthermore the p-value for owners, or \(\beta_2\) (1.106081e-32) is less than the t-value (12.673319) I will also reject the \(H_1\) hypothesis and conclude that \(\beta_2 \not= 0\). This follows from the ANOVA test as shown above, and further suggests that there is still in fact an effect that exists in the data.
\(\beta_1\) Confidence Interval
2.5 % 97.5 %
0.1780153 0.4475362
\(\beta_2\) Confidence Interval
2.5 % 97.5 %
1.413577 1.932066
These confidence intervals show that, given the model, I am 95% confident that \(\beta_1\) lies somewhere between 0.1780153 and 0.4475362 and that \(\beta_2\) lies somewhere between 1.413577 and 1.932066
I will once again plot the model now that I have made it more precise.
For this basic predication I will be checking the fitted values for every combination of 1-300 assets and 1-300 owners, for a total of 90,000 predictions. I will then look at the top 900 (1%) and determine what amount of owners and assets are most common in the top 1% of the predictions to determine what amount of owners and assets will produce the highest amount of sales.
206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4
226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245
4 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 8 8 8
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265
8 8 9 9 9 9 9 9 10 10 10 10 10 11 11 11 11 11 12 12
266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285
12 12 12 12 13 13 13 13 13 14 14 14 14 14 15 15 15 15 15 15
286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
16 16 16 16 16 17 17 17 17 17 18 18 18 18 18
After reviewing the output it appears that the most numbers of sales occurred with 300 assets and 300 owners. Next we see that with assets = 300 we have 18 instances in the top 1% of cases so it appears that the more assets a collection has the more sales it would have, which would seem to make sense, as each individual NFT would have its own number of sales, but it does seem a bit surprising as one would assume as the number of assets in a collection rose tremendously it would begin to eventually have less total sales, as the collection may loose its rarity with more owners.
Overall the results were not at all unexpected, but they do confirm that, for the data that was used, when the number of owners and assets in a NFT collection increases, so increases the sales of said collection. This is not surprising as an increase in owners would mean that more transactions would occur. However, I found it interesting that it appears as assets increases so too does the sales, I would’ve assumed that there would be a certain amount of assets that would make a collection seem common, but it appears, at least with this data, that rarity of the collection is not affected by the number of assets in a collection.
CFI Team. “Adjusted R-Squared.” Corporate Finance Institute, May 5, 2022. https://corporatefinanceinstitute.com/resources/knowledge/other/adjusted-r-squared/.
Malikah, Nena. “Top NFT Collections.” Kaggle, January 17, 2022. https://www.kaggle.com/datasets/nenamalikah/nft-collections-by-sales-volume.