Introduction

Dimensionality is the number of variables, characteristics or features present in the data set.This redundant information impacts negatively in Machine Learning model’s training and performance and that is why using dimensional reduction methods becomes of paramount importance. The goal of dimensionality reduction is to reduce the number of dimensions in a way that the new data remains useful. The aim of this project is to use the provided data-set “crypto currency market cap” and to reduce the component that ar not needed keeping the data set useful. I have selected a data-set containing information about various different crypto currinces- blockchain based digital coins. Data-set has been selected from kaggle.

Dataset Processing

A general purpose toolbox for personality, psychometric theory and experimental psychology. Functions are primarily for multivariate analysis and scale construction using factor analysis, principal component analysis, cluster analysis and reliability analysis, although others provide basic descriptive statistics.

library(psych)

Dataset Summary

crypto_data<- read.csv('C:\\Users\\HP\\Desktop\\Study\\Unsupervised\\cryptoCoinMarketcap2018.csv', header = T)
head(crypto_data)
##   X X24h_volume_usd available_supply           id last_updated market_cap_usd
## 1 0     22081300000         16785225      bitcoin   1515230661   284909000000
## 2 1      5221370000      38739144847       ripple   1515230641   119208000000
## 3 2      5705690000         96803840     ethereum   1515230649   100115000000
## 4 3      1569900000         16896225 bitcoin-cash   1515230652    44424061657
## 5 4       428305000      25927070538      cardano   1515230654    25916647856
## 6 5      2105240000         54637708     litecoin   1515230641    16574020942
##   max_supply         name percent_change_1h percent_change_24h
## 1    2.1e+07      Bitcoin             -0.42               5.76
## 2    1.0e+11       Ripple             -0.26              -9.23
## 3         NA     Ethereum              0.29              -1.04
## 4    2.1e+07 Bitcoin Cash              0.03               7.99
## 5    4.5e+10      Cardano              0.39              -5.87
## 6    8.4e+07     Litecoin              2.31              22.26
##   percent_change_7d  price_btc   price_usd rank symbol total_supply
## 1             26.04 1.00000000 1.69738e+04    1    BTC     16785225
## 2             24.15 0.00018601 3.07719e+00    2    XRP  99993093880
## 3             45.01 0.06251690 1.03421e+03    3    ETH     96803840
## 4              2.81 0.15893400 2.62923e+03    4    BCH     16896225
## 5             64.99 0.00006040 9.99598e-01    5    ADA  31112483745
## 6             32.85 0.01833680 3.03344e+02    6    LTC     54637708
summary(crypto_data)
##        X         X24h_volume_usd     available_supply         id           
##  Min.   : 0.00   Min.   :3.895e+05   Min.   :6.452e+05   Length:100        
##  1st Qu.:24.75   1st Qu.:1.983e+07   1st Qu.:5.479e+07   Class :character  
##  Median :49.50   Median :4.443e+07   Median :2.045e+08   Mode  :character  
##  Mean   :49.50   Mean   :5.313e+08   Mean   :4.398e+10                     
##  3rd Qu.:74.25   3rd Qu.:1.714e+08   3rd Qu.:1.503e+09                     
##  Max.   :99.00   Max.   :2.208e+10   Max.   :2.510e+12                     
##                                                                            
##   last_updated       market_cap_usd        max_supply            name          
##  Min.   :1.515e+09   Min.   :2.995e+08   Min.   :1.890e+07   Length:100        
##  1st Qu.:1.515e+09   1st Qu.:4.614e+08   1st Qu.:6.863e+07   Class :character  
##  Median :1.515e+09   Median :7.297e+08   Median :8.880e+08   Mode  :character  
##  Mean   :1.515e+09   Mean   :7.608e+09   Mean   :3.111e+11                     
##  3rd Qu.:1.515e+09   3rd Qu.:1.980e+09   3rd Qu.:1.008e+10                     
##  Max.   :1.515e+09   Max.   :2.849e+11   Max.   :8.000e+12                     
##                                          NA's   :73                            
##  percent_change_1h percent_change_24h percent_change_7d   price_btc        
##  Min.   :-8.8500   Min.   :-20.830    Min.   : -16.09   Min.   :0.0000000  
##  1st Qu.:-1.0075   1st Qu.: -3.085    1st Qu.:  16.17   1st Qu.:0.0000435  
##  Median : 0.1500   Median :  0.390    Median :  53.49   Median :0.0002462  
##  Mean   : 0.0517   Mean   : 11.169    Mean   : 112.12   Mean   :0.0160525  
##  3rd Qu.: 1.2700   3rd Qu.: 11.662    3rd Qu.: 124.28   3rd Qu.:0.0015478  
##  Max.   : 6.9400   Max.   :210.410    Max.   :2099.78   Max.   :1.0000000  
##                                                                            
##    price_usd              rank           symbol           total_supply      
##  Min.   :    0.000   Min.   :  1.00   Length:100         Min.   :1.000e+06  
##  1st Qu.:    0.720   1st Qu.: 25.75   Class :character   1st Qu.:8.235e+07  
##  Median :    4.072   Median : 50.50   Mode  :character   Median :2.683e+08  
##  Mean   :  269.864   Mean   : 50.50                      Mean   :1.641e+11  
##  3rd Qu.:   25.605   3rd Qu.: 75.25                      3rd Qu.:2.039e+09  
##  Max.   :16973.800   Max.   :100.00                      Max.   :1.000e+13  
## 

Excluding variables

Some variables are needed to be excluded from data set that are unnecessary.For this project’s need I exclude all the variable having data type as “char”.

PCA_crypto <- crypto_data[,c(-4,-8,-15)]

Applying Principal component Package

Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.

PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimension reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible.

myPC_data <- princomp(PCA_crypto[,-6],cor = T)

plot(myPC_data ,  main="Principle Component Analysis")

Summary of component for reduced dataset

names(myPC_data)
## [1] "sdev"     "loadings" "center"   "scale"    "n.obs"    "scores"   "call"
summary(myPC_data)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.0329197 1.7131514 1.3954422 1.05533629 0.86478306
## Proportion of Variance 0.3443969 0.2445740 0.1622716 0.09281122 0.06232081
## Cumulative Proportion  0.3443969 0.5889709 0.7512424 0.84405367 0.90637448
##                            Comp.6     Comp.7    Comp.8      Comp.9     Comp.10
## Standard deviation     0.72772532 0.59096736 0.3902260 0.255914447 0.164018692
## Proportion of Variance 0.04413201 0.02910354 0.0126897 0.005457684 0.002241844
## Cumulative Proportion  0.95050649 0.97961003 0.9922997 0.997757408 0.999999252
##                             Comp.11 Comp.12
## Standard deviation     2.995235e-03       0
## Proportion of Variance 7.476193e-07       0
## Cumulative Proportion  1.000000e+00       1

Applying Eigen values

Eigenvalue is a number telling that how much variance is there is the data-set in specific direction.

engevectors <- myPC_data$loadings
eigenvalues <- myPC_data$sdev * myPC_data$sdev

By applying the x and y grid on data set will give us a matrix that we can use to analyse the relevance between components and data. Data will be more relevant when the components are closer to each others.

round(cor(PCA_crypto[,c(-4,-8,-15)],myPC_data$scores),3)
##                   Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## X                  0.489  0.069  0.821  0.217  0.182  0.042  0.006  0.026
## X24h_volume_usd   -0.958 -0.112  0.154  0.047  0.076 -0.012  0.017  0.140
## available_supply   0.141 -0.925 -0.002  0.050  0.066 -0.177  0.234 -0.043
## market_cap_usd    -0.947 -0.101  0.105  0.035  0.084 -0.005  0.045  0.247
## max_supply            NA     NA     NA     NA     NA     NA     NA     NA
## percent_change_1h  0.039  0.055  0.245 -0.865  0.339 -0.266 -0.018 -0.008
## percent_change_7d  0.173 -0.890  0.030  0.170 -0.117 -0.298  0.128  0.014
## price_btc         -0.934 -0.109  0.266  0.076  0.067  0.020 -0.028 -0.184
## price_usd         -0.933 -0.110  0.267  0.076  0.067  0.020 -0.028 -0.183
## rank               0.489  0.069  0.821  0.217  0.182  0.042  0.006  0.026
## total_supply       0.102 -0.712  0.041 -0.357  0.030  0.576  0.137  0.005
##                   Comp.9 Comp.10 Comp.11 Comp.12
## X                  0.001   0.000   0.000  -0.038
## X24h_volume_usd    0.034  -0.128   0.000   0.008
## available_supply  -0.172  -0.022   0.000   0.002
## market_cap_usd    -0.040   0.095   0.000  -0.009
## max_supply            NA      NA      NA      NA
## percent_change_1h  0.020   0.003   0.000  -0.005
## percent_change_7d  0.171   0.022   0.000   0.004
## price_btc          0.009   0.019   0.002  -0.010
## price_usd          0.010   0.017  -0.002  -0.006
## rank               0.001   0.000   0.000  -0.038
## total_supply       0.042   0.003   0.000   0.005

Scree Plot

A scree plot is a graphical tool used in the selection of the number of relevant components or factors to be considered in a principal components analysis or a factor analysis. The correct component to be used here must have variance equal to one

screeplot(myPC_data,type ='l',main="Scree Plot Crypto marketcap Analysis")
abline(1,0,col = "red",lty =2)

The below plot shows how scores are for “ComponentA’ to”Component B”

plot(myPC_data$scores[,1:2],type = 'n',xlab='ComponentA',ylab = 'Component B')
points(myPC_data$scores[,1:2],cex = 0.5)

Principle analysis:

PCA2 <- principal(PCA_crypto,nfactors =3, rotate = "none")
## Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done
## Warning in principal(PCA_crypto, nfactors = 3, rotate = "none"): The matrix is
## not positive semi-definite, scores found from Structure loadings
PCA2
## Principal Components Analysis
## Call: principal(r = PCA_crypto, nfactors = 3, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                      PC1   PC2   PC3   h2    u2 com
## X                  -0.50 -0.06  0.78 0.87 0.129 1.7
## X24h_volume_usd     0.90  0.35  0.15 0.95 0.046 1.4
## available_supply   -0.35  0.89 -0.16 0.93 0.065 1.4
## last_updated        0.09  0.12  0.58 0.36 0.637 1.1
## market_cap_usd      0.89  0.33  0.11 0.92 0.082 1.3
## max_supply         -0.40  0.76  0.27 0.82 0.181 1.8
## percent_change_1h  -0.06  0.04  0.32 0.11 0.895 1.1
## percent_change_24h -0.29  0.72 -0.22 0.65 0.346 1.5
## percent_change_7d  -0.31  0.69 -0.23 0.62 0.380 1.6
## price_btc           0.87  0.35  0.25 0.95 0.051 1.5
## price_usd           0.87  0.35  0.25 0.95 0.051 1.5
## rank               -0.50 -0.06  0.78 0.87 0.129 1.7
## total_supply       -0.31  0.79  0.00 0.72 0.284 1.3
## 
##                        PC1  PC2  PC3
## SS loadings           4.21 3.48 2.03
## Proportion Var        0.32 0.27 0.16
## Cumulative Var        0.32 0.59 0.75
## Proportion Explained  0.43 0.36 0.21
## Cumulative Proportion 0.43 0.79 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.1 
##  with the empirical chi square  151.89  with prob <  2.3e-14 
## 
## Fit based upon off diagonal values = 0.94

Conclusion

From the above stats it can be concluded that Cumulative variance of three components i.e 75% is enough to explain the whole data-set and can be replaced with those components which are not so useful to expalin the data-set.

https://www.kaggle.com/kingabzpro/crypto?select=coinmarketcap_06012018.csv https://stackoverflow.com/questions