Dimensionality is the number of variables, characteristics or features present in the data set.This redundant information impacts negatively in Machine Learning model’s training and performance and that is why using dimensional reduction methods becomes of paramount importance. The goal of dimensionality reduction is to reduce the number of dimensions in a way that the new data remains useful. The aim of this project is to use the provided data-set “crypto currency market cap” and to reduce the component that ar not needed keeping the data set useful. I have selected a data-set containing information about various different crypto currinces- blockchain based digital coins. Data-set has been selected from kaggle.
A general purpose toolbox for personality, psychometric theory and experimental psychology. Functions are primarily for multivariate analysis and scale construction using factor analysis, principal component analysis, cluster analysis and reliability analysis, although others provide basic descriptive statistics.
library(psych)
crypto_data<- read.csv('C:\\Users\\HP\\Desktop\\Study\\Unsupervised\\cryptoCoinMarketcap2018.csv', header = T)
head(crypto_data)
## X X24h_volume_usd available_supply id last_updated market_cap_usd
## 1 0 22081300000 16785225 bitcoin 1515230661 284909000000
## 2 1 5221370000 38739144847 ripple 1515230641 119208000000
## 3 2 5705690000 96803840 ethereum 1515230649 100115000000
## 4 3 1569900000 16896225 bitcoin-cash 1515230652 44424061657
## 5 4 428305000 25927070538 cardano 1515230654 25916647856
## 6 5 2105240000 54637708 litecoin 1515230641 16574020942
## max_supply name percent_change_1h percent_change_24h
## 1 2.1e+07 Bitcoin -0.42 5.76
## 2 1.0e+11 Ripple -0.26 -9.23
## 3 NA Ethereum 0.29 -1.04
## 4 2.1e+07 Bitcoin Cash 0.03 7.99
## 5 4.5e+10 Cardano 0.39 -5.87
## 6 8.4e+07 Litecoin 2.31 22.26
## percent_change_7d price_btc price_usd rank symbol total_supply
## 1 26.04 1.00000000 1.69738e+04 1 BTC 16785225
## 2 24.15 0.00018601 3.07719e+00 2 XRP 99993093880
## 3 45.01 0.06251690 1.03421e+03 3 ETH 96803840
## 4 2.81 0.15893400 2.62923e+03 4 BCH 16896225
## 5 64.99 0.00006040 9.99598e-01 5 ADA 31112483745
## 6 32.85 0.01833680 3.03344e+02 6 LTC 54637708
summary(crypto_data)
## X X24h_volume_usd available_supply id
## Min. : 0.00 Min. :3.895e+05 Min. :6.452e+05 Length:100
## 1st Qu.:24.75 1st Qu.:1.983e+07 1st Qu.:5.479e+07 Class :character
## Median :49.50 Median :4.443e+07 Median :2.045e+08 Mode :character
## Mean :49.50 Mean :5.313e+08 Mean :4.398e+10
## 3rd Qu.:74.25 3rd Qu.:1.714e+08 3rd Qu.:1.503e+09
## Max. :99.00 Max. :2.208e+10 Max. :2.510e+12
##
## last_updated market_cap_usd max_supply name
## Min. :1.515e+09 Min. :2.995e+08 Min. :1.890e+07 Length:100
## 1st Qu.:1.515e+09 1st Qu.:4.614e+08 1st Qu.:6.863e+07 Class :character
## Median :1.515e+09 Median :7.297e+08 Median :8.880e+08 Mode :character
## Mean :1.515e+09 Mean :7.608e+09 Mean :3.111e+11
## 3rd Qu.:1.515e+09 3rd Qu.:1.980e+09 3rd Qu.:1.008e+10
## Max. :1.515e+09 Max. :2.849e+11 Max. :8.000e+12
## NA's :73
## percent_change_1h percent_change_24h percent_change_7d price_btc
## Min. :-8.8500 Min. :-20.830 Min. : -16.09 Min. :0.0000000
## 1st Qu.:-1.0075 1st Qu.: -3.085 1st Qu.: 16.17 1st Qu.:0.0000435
## Median : 0.1500 Median : 0.390 Median : 53.49 Median :0.0002462
## Mean : 0.0517 Mean : 11.169 Mean : 112.12 Mean :0.0160525
## 3rd Qu.: 1.2700 3rd Qu.: 11.662 3rd Qu.: 124.28 3rd Qu.:0.0015478
## Max. : 6.9400 Max. :210.410 Max. :2099.78 Max. :1.0000000
##
## price_usd rank symbol total_supply
## Min. : 0.000 Min. : 1.00 Length:100 Min. :1.000e+06
## 1st Qu.: 0.720 1st Qu.: 25.75 Class :character 1st Qu.:8.235e+07
## Median : 4.072 Median : 50.50 Mode :character Median :2.683e+08
## Mean : 269.864 Mean : 50.50 Mean :1.641e+11
## 3rd Qu.: 25.605 3rd Qu.: 75.25 3rd Qu.:2.039e+09
## Max. :16973.800 Max. :100.00 Max. :1.000e+13
##
Some variables are needed to be excluded from data set that are unnecessary.For this project’s need I exclude all the variable having data type as “char”.
PCA_crypto <- crypto_data[,c(-4,-8,-15)]
Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.
PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimension reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible.
myPC_data <- princomp(PCA_crypto[,-6],cor = T)
plot(myPC_data , main="Principle Component Analysis")
names(myPC_data)
## [1] "sdev" "loadings" "center" "scale" "n.obs" "scores" "call"
summary(myPC_data)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.0329197 1.7131514 1.3954422 1.05533629 0.86478306
## Proportion of Variance 0.3443969 0.2445740 0.1622716 0.09281122 0.06232081
## Cumulative Proportion 0.3443969 0.5889709 0.7512424 0.84405367 0.90637448
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 0.72772532 0.59096736 0.3902260 0.255914447 0.164018692
## Proportion of Variance 0.04413201 0.02910354 0.0126897 0.005457684 0.002241844
## Cumulative Proportion 0.95050649 0.97961003 0.9922997 0.997757408 0.999999252
## Comp.11 Comp.12
## Standard deviation 2.995235e-03 0
## Proportion of Variance 7.476193e-07 0
## Cumulative Proportion 1.000000e+00 1
Eigenvalue is a number telling that how much variance is there is the data-set in specific direction.
engevectors <- myPC_data$loadings
eigenvalues <- myPC_data$sdev * myPC_data$sdev
By applying the x and y grid on data set will give us a matrix that we can use to analyse the relevance between components and data. Data will be more relevant when the components are closer to each others.
round(cor(PCA_crypto[,c(-4,-8,-15)],myPC_data$scores),3)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## X 0.489 0.069 0.821 0.217 0.182 0.042 0.006 0.026
## X24h_volume_usd -0.958 -0.112 0.154 0.047 0.076 -0.012 0.017 0.140
## available_supply 0.141 -0.925 -0.002 0.050 0.066 -0.177 0.234 -0.043
## market_cap_usd -0.947 -0.101 0.105 0.035 0.084 -0.005 0.045 0.247
## max_supply NA NA NA NA NA NA NA NA
## percent_change_1h 0.039 0.055 0.245 -0.865 0.339 -0.266 -0.018 -0.008
## percent_change_7d 0.173 -0.890 0.030 0.170 -0.117 -0.298 0.128 0.014
## price_btc -0.934 -0.109 0.266 0.076 0.067 0.020 -0.028 -0.184
## price_usd -0.933 -0.110 0.267 0.076 0.067 0.020 -0.028 -0.183
## rank 0.489 0.069 0.821 0.217 0.182 0.042 0.006 0.026
## total_supply 0.102 -0.712 0.041 -0.357 0.030 0.576 0.137 0.005
## Comp.9 Comp.10 Comp.11 Comp.12
## X 0.001 0.000 0.000 -0.038
## X24h_volume_usd 0.034 -0.128 0.000 0.008
## available_supply -0.172 -0.022 0.000 0.002
## market_cap_usd -0.040 0.095 0.000 -0.009
## max_supply NA NA NA NA
## percent_change_1h 0.020 0.003 0.000 -0.005
## percent_change_7d 0.171 0.022 0.000 0.004
## price_btc 0.009 0.019 0.002 -0.010
## price_usd 0.010 0.017 -0.002 -0.006
## rank 0.001 0.000 0.000 -0.038
## total_supply 0.042 0.003 0.000 0.005
A scree plot is a graphical tool used in the selection of the number of relevant components or factors to be considered in a principal components analysis or a factor analysis. The correct component to be used here must have variance equal to one
screeplot(myPC_data,type ='l',main="Scree Plot Crypto marketcap Analysis")
abline(1,0,col = "red",lty =2)
plot(myPC_data$scores[,1:2],type = 'n',xlab='ComponentA',ylab = 'Component B')
points(myPC_data$scores[,1:2],cex = 0.5)
PCA2 <- principal(PCA_crypto,nfactors =3, rotate = "none")
## Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done
## Warning in principal(PCA_crypto, nfactors = 3, rotate = "none"): The matrix is
## not positive semi-definite, scores found from Structure loadings
PCA2
## Principal Components Analysis
## Call: principal(r = PCA_crypto, nfactors = 3, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 PC3 h2 u2 com
## X -0.50 -0.06 0.78 0.87 0.129 1.7
## X24h_volume_usd 0.90 0.35 0.15 0.95 0.046 1.4
## available_supply -0.35 0.89 -0.16 0.93 0.065 1.4
## last_updated 0.09 0.12 0.58 0.36 0.637 1.1
## market_cap_usd 0.89 0.33 0.11 0.92 0.082 1.3
## max_supply -0.40 0.76 0.27 0.82 0.181 1.8
## percent_change_1h -0.06 0.04 0.32 0.11 0.895 1.1
## percent_change_24h -0.29 0.72 -0.22 0.65 0.346 1.5
## percent_change_7d -0.31 0.69 -0.23 0.62 0.380 1.6
## price_btc 0.87 0.35 0.25 0.95 0.051 1.5
## price_usd 0.87 0.35 0.25 0.95 0.051 1.5
## rank -0.50 -0.06 0.78 0.87 0.129 1.7
## total_supply -0.31 0.79 0.00 0.72 0.284 1.3
##
## PC1 PC2 PC3
## SS loadings 4.21 3.48 2.03
## Proportion Var 0.32 0.27 0.16
## Cumulative Var 0.32 0.59 0.75
## Proportion Explained 0.43 0.36 0.21
## Cumulative Proportion 0.43 0.79 1.00
##
## Mean item complexity = 1.5
## Test of the hypothesis that 3 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.1
## with the empirical chi square 151.89 with prob < 2.3e-14
##
## Fit based upon off diagonal values = 0.94
From the above stats it can be concluded that Cumulative variance of three components i.e 75% is enough to explain the whole data-set and can be replaced with those components which are not so useful to expalin the data-set.
https://www.kaggle.com/kingabzpro/crypto?select=coinmarketcap_06012018.csv https://stackoverflow.com/questions