Abstract

A NFT is a digital asset that represents real-world objects like art, music, in-game items and videos. They are bought and sold online, frequently with cryptocurrency, and they are generally encoded with the same underlying software as many cryptos. Although they’ve been around since 2014, NFTs are gaining notoriety now because they are becoming an increasingly popular way to buy and sell digital artwork. A staggering $174 million has been spent on NFTs since November 2017.1 The purpose of this project is to preform exploratory data analysis and determine what variables are most positively correlated with an increase in sales. This will help future NFT creators to determine what they should focus on when listing their pieces.

Introduction

The data set used in this project was pulled on January 16, 2022, and represents all time information for the top NFT collections. As an example, the Sales column represents all sales under a specified NFT collection from its creation up until January 16, 2022.

The data set used consists of the following information:

Initial Exploratory Data

Correlation Heat Map

From this I would like to look more closely at the correlation of owners and assets in relation to sales, to determine if an increase in strictly owners leads to more sales, if an increase in strictly assets leads to more sales, or if an increase in both leads to more total sales.

Simple Linear Regression Models

First, I will plot a linear regression model that relates sales to owners.

As expected the linear model shows a positively correlated relationship between the two variables, suggesting that as the number of owners increases so increases the total sales of a collection. Next I will plot the linear regression model that relates sales to assets.

Once more this shows that there is a positively correlated relationship between the two variables, suggesting that as the number of assets in a collection increases so increases the total sales of a collection. When comparing this model to the model relating sales to owners, it is obvious, when observing the linear regression line, that there is more variance in the response variable that can be explained by the predictor variables in this model. I will cover how I fully determine this later.

Multi-Linear Regression Model

I will also plot a basic multi-linear regression model relating sales to owners and assets. This will determine if both predictor variables improve or degrade the coefficient of determination.

When viewing the multi-linear regression plane I expected to see that the plane would suggest there is a positive correlation between the two variables(owners & assets) related to sales, and that is exactly what this initial model shows.

Understanding the Models and determining Variance

\(\beta_{1}\) & \(\beta_{2}\) Coefiecients

(Intercept)      Assets      Owners 
 13.4445201   0.1637544   2.4555063 

\(b_1\): All else held constant, for an increase in assets, we expect the sales to be increase, on average, by 0.16 total sales.

\(b_2\): All else held constant, one additional owner causes the sales, on average, to increase by 2.45 total sales.

\(R_{Adj}^2\) Values

The \(R_{Adj}^2\) determines if additional input variables are contributing to the model.2 With that being said, I will now be comparing the \(R_{Adj}^2\) for the three models used above to determine the best fit model, and will continue with the best model going forward.

Multi-Linear Regression’s \(R_{Adj}^2\)

[1] 0.6588068

Simple-Linear Regression (Sales vs Assets) \(R_{Adj}^2\)

[1] 0.5674074

Simple-Linear Regression (Sales vs Owners) \(R_{Adj}^2\)

[1] 0.6579407

The tells me that in the Multi-Linear regression model there is more variance in the response variable that can be explained by the predictor variables in the model. As a result I want to continue to use the Multi-Linear regression going forward.

Hypothesis Testing of Multi-Linear Regression Model (ANOVA & Marginal Test)

For this first test I will be preforming an ANOVA test, comparing the null and full models. My hypotheses for this test will be:
\(H_0: \beta_1 = \beta_2 = 0\)
\(H_1: \beta_1 \not= 0\) or \(\beta_2 \not= 0\)

When reviewing the ANOVA results above, it is obvious that the p-value (1.084013e-138) is less than the significance level (0.05). As a result I can reject the null hypothesis. This suggests that there is in fact an effect that exists in the data. Next I will be preforming marginal tests on \(\beta_1\) & \(\beta_2\)

\(H_0: \beta_1 = 0\)
\(H_1: \beta_2 = 0\)

              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 13.4445201 11.9573068  1.124377 2.613112e-01
Assets       0.1637544  0.1036157  1.580401 1.145520e-01
Owners       2.4555063  0.1947035 12.611513 1.890403e-32

Given that the p-value for assets, or \(\beta_1\), (1.145520e-01) is less than the t-value (1.580401) I will reject the \(H_0\) hypothesis and conclude that \(\beta_1 \not= 0\). Similarly the p-value for owners, or \(\beta_2\) (1.890403e-32) is less than the t-value (12.611513) I will also reject the \(H_1\) hypothesis and conclude that \(\beta_2 \not= 0\). This follows from the ANOVA test as shown above, and further suggests that there is in fact an effect that exists in the data.

Confidence intervals for \(\beta_1\) & \(\beta_2\)

\(\beta_1\) Confidence Interval

      2.5 %      97.5 % 
-0.03974689  0.36725562 

\(\beta_2\) Confidence Interval

   2.5 %   97.5 % 
2.073109 2.837904 

These confidence intervals show that, given the model, I am 95% confident that \(\beta_1\) lies somewhere between -0.04 and 0.37 and that \(\beta_2\) lies somewhere between 2.07 and 2.84.

Regression Diagnostics

I will be using Cook’s Distance \(D_i\) application for measuring influence. This measures the squared distance that the vector of fitted values moves when the \(i\)th observation is deleted.

       8       20       28 
3.934975 1.530798 1.645267 

I will further pair this with the \(COVRATIO_i\) to determine whether or not the points found from Cook’s Distance \(D_i\) improve or degrade the precision

       6        9       10       16       28       35       44       48 
1.034268 1.042111 1.015910 1.067555 1.555991 1.018704 1.095826 1.053328 
      53       55       90      103      107      243      244      246 
1.017457 1.141492 1.050981 1.058377 1.198114 1.016917 1.038842 1.015512 
     349      363      376      586 
1.039933 1.105400 1.015221 1.027835 

First I know that the points above improve the precision as \(COVRATIO_i > 1\). I can see that 28 is in those points that improve precision, so I will not consider removal of the influential point.

        2         5         7         8        11        12        15        20 
0.7893084 0.9771029 0.9734095 0.5821241 0.8080388 0.9832904 0.9590974 0.5816751 
       25        84        95       109       131       136       419       449 
0.9646836 0.9386441 0.9778603 0.9826148 0.9827753 0.9801984 0.9171562 0.8879161 
      573 
0.9231860 

Next I show the points that degrade the precision as \(COVRATIO_i < 1\). I can see that both 8 and 20 are in those point that degrade precision so I will continue to consider removal of the points.

Given these observations I will plot the R-Student residuals to detect the abnormality in the model.

[1]  8 20

As expected this suggests that the two points are in fact degrading the precision, so I will make a new model that removes the two values and test again.

Normalizing Model

New Multi-Linear Regression’s \(R_{Adj}^2\)

[1] 0.6687642

As expected with the removal of the degrading points the variance is up from 0.6588068 to 0.6687642 for an increase in about 0.0099574. This means that with the removal of the degrading points the variance explained by the predictor variables has improved, making for a more precise model.

I’d like to go further and remove more outliers using the R-Student residuals to detect the abnormality in the new model.

 2 11 
 2 10 

I will do this and compare \(R_{Adj}^2\) values until I reach a conclusion that the model cannot be any more precise

[1] 0.7024546

 11 449 
  9 445 
[1] 0.7246209

 84 449 
 78 443 
[1] 0.7167459

Now that the \(R_{Adj}^2\) value has began to degenerate I will use the model before this to test assumptions and hypotheses

I will begin by be re-preforming an ANOVA test, comparing the null and full models. My hypotheses for this test will be:
\(H_0: \beta_1 = \beta_2 = 0\)
\(H_1: \beta_1 \not= 0\) or \(\beta_2 \not= 0\)

When reviewing the ANOVA results above, it is obvious that the p-value (2.024551e-164) is less than the significance level (0.05). As a result I can reject the null hypothesis. This again suggests that there is still in fact an effect that exists in the data.

Next I will again preform marginal tests on \(\beta_1\) & \(\beta_2\)
\(H_0: \beta_1 = 0\)
\(H_1: \beta_2 = 0\)

              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 20.0240464 7.86207910  2.546915 1.112382e-02
Assets       0.3127757 0.06861386  4.558492 6.280871e-06
Owners       1.6728212 0.13199551 12.673319 1.106081e-32

Given that the p-value for assets, or \(\beta_1\), (6.280871e-06) is less than the t-value (4.558492) I will again reject the \(H_0\) hypothesis and conclude that \(\beta_1 \not= 0\). Furthermore the p-value for owners, or \(\beta_2\) (1.106081e-32) is less than the t-value (12.673319) I will also reject the \(H_1\) hypothesis and conclude that \(\beta_2 \not= 0\). This follows from the ANOVA test as shown above, and further suggests that there is still in fact an effect that exists in the data.

\(\beta_1\) Confidence Interval

    2.5 %    97.5 % 
0.1780153 0.4475362 

\(\beta_2\) Confidence Interval

   2.5 %   97.5 % 
1.413577 1.932066 

These confidence intervals show that, given the model, I am 95% confident that \(\beta_1\) lies somewhere between 0.1780153 and 0.4475362 and that \(\beta_2\) lies somewhere between 1.413577 and 1.932066

Improved Multi-Linear Regression Model

I will once again plot the model now that I have made it more precise.

Basic Prediction of Sales

For this basic predication I will be checking the fitted values for every combination of 1-300 assets and 1-300 owners, for a total of 90,000 predictions. I will then look at the top 900 (1%) and determine what amount of owners and assets are most common in the top 1% of the predictions to determine what amount of owners and assets will produce the highest amount of sales.


206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 
  1   1   1   1   1   2   2   2   2   2   3   3   3   3   3   3   4   4   4   4 
226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 
  4   5   5   5   5   5   6   6   6   6   6   6   7   7   7   7   7   8   8   8 
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 
  8   8   9   9   9   9   9   9  10  10  10  10  10  11  11  11  11  11  12  12 
266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 
 12  12  12  12  13  13  13  13  13  14  14  14  14  14  15  15  15  15  15  15 
286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 
 16  16  16  16  16  17  17  17  17  17  18  18  18  18  18 

After reviewing the output it appears that the most numbers of sales occurred with 300 assets and 300 owners. Next we see that with assets = 300 we have 18 instances in the top 1% of cases so it appears that the more assets a collection has the more sales it would have, which would seem to make sense, as each individual NFT would have its own number of sales, but it does seem a bit surprising as one would assume as the number of assets in a collection rose tremendously it would begin to eventually have less total sales, as the collection may loose its rarity with more owners.

Conclusion

Overall the results were not at all unexpected, but they do confirm that, for the data that was used, when the number of owners and assets in a NFT collection increases, so increases the sales of said collection. This is not surprising as an increase in owners would mean that more transactions would occur. However, I found it interesting that it appears as assets increases so too does the sales, I would’ve assumed that there would be a certain amount of assets that would make a collection seem common, but it appears, at least with this data, that rarity of the collection is not affected by the number of assets in a collection.

Sources

CFI Team. “Adjusted R-Squared.” Corporate Finance Institute, May 5, 2022. https://corporatefinanceinstitute.com/resources/knowledge/other/adjusted-r-squared/.

Malikah, Nena. “Top NFT Collections.” Kaggle, January 17, 2022. https://www.kaggle.com/datasets/nenamalikah/nft-collections-by-sales-volume.


  1. Malikah, Nena. “Top NFT Collections.”↩︎

  2. CFI Team, “Adjusted R-Squared.”↩︎