In this short survey we are going to have a dimension reduction on Bank Marketing data.The data is related to the campaigns which was hold by one of Portuguese banking institution which was mainly based on the phone calls made, often contacting a client more than one time. The aim is to use PCA and MDS together with the plots. Our dataset was acquired from the http://archive.ics.uci.edu/ml/index.php. The dataset and attribute information can be found at http://archive.ics.uci.edu/ml/datasets/Bank+Marketing.
Following libraries should be downloaded before start:
#library(factoextra)
#library(labdsv)
#library(vegan)
#library(ape)
#library(ggfortify)
#library(factoextra)
#library(pca3d)
#library(pls)
#library(ClusterR)
#library(labdsv)
#library(devtools)
#library(ggplot2)
#library(readstata13)
library(FactoMineR)
library(MASS)
library(smacof)
## Loading required package: plotrix
Sometimes while dealing with data you can face with the problem of having a high dimensionality data (where you have more variables then observations) which most probably has a multicollinearity problem. To transform the dataset into a more reliable subset one has to decrease the dimension of the data. For this you can make use of the Principal component Analysis (PCA) and Multidimensional Scaling (MDS). Firstly we are going to go through the PCA. Here we generally decrease. It generally consists of 5 steps:
## age balance day duration campaign
## 1 30 1787 19 79 1
## 2 33 4789 11 220 1
## 3 35 1350 16 185 1
## 4 30 1476 3 199 4
## 5 59 0 5 226 1
## 6 35 747 23 141 2
## 'data.frame': 555 obs. of 5 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ duration: int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign: int 1 1 1 4 1 2 1 2 2 1 ...
## age balance day duration
## Min. :19.00 Min. :-1206 Min. : 1.00 Min. : 5.0
## 1st Qu.:33.00 1st Qu.: 102 1st Qu.: 9.00 1st Qu.: 109.0
## Median :40.00 Median : 480 Median :16.00 Median : 194.0
## Mean :41.29 Mean : 1465 Mean :15.91 Mean : 271.3
## 3rd Qu.:49.00 3rd Qu.: 1592 3rd Qu.:21.00 3rd Qu.: 336.5
## Max. :78.00 Max. :16873 Max. :31.00 Max. :1877.0
## campaign
## Min. : 1.000
## 1st Qu.: 1.000
## Median : 2.000
## Mean : 2.595
## 3rd Qu.: 3.000
## Max. :32.000
Adjustment of prcomp() function prcomp() - Performs a PCA on the given data matrix and returns the results as an object. Results are rotation, loadings and the standart deviation
bank.pca <- prcomp(scale(bank1), center = TRUE, scale = TRUE)
names(bank.pca)
## [1] "sdev" "rotation" "center" "scale" "x"
print(bank.pca)
## Standard deviations (1, .., p=5):
## [1] 1.1188712 1.0731079 0.9870105 0.9219533 0.8788511
##
## Rotation (n x k) = (5 x 5):
## PC1 PC2 PC3 PC4 PC5
## age 0.4375835 -0.5145330 0.06308829 -0.6678371 -0.3062513800
## balance 0.3833436 -0.4230569 0.58534721 0.5094708 0.2681021012
## day -0.5679005 -0.4083800 0.15251872 0.2549888 -0.6499489335
## duration -0.2483390 0.3964484 0.79159116 -0.3931152 -0.0005804307
## campaign -0.5266737 -0.4820095 -0.05924517 -0.2736316 0.6417923360
summary(bank.pca)
## Importance of components%s:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.1189 1.0731 0.9870 0.9220 0.8789
## Proportion of Variance 0.2504 0.2303 0.1948 0.1700 0.1545
## Cumulative Proportion 0.2504 0.4807 0.6755 0.8455 1.0000
Here we have 5 varibles, principal components, where you can choose (specific number of PCs) with the highest variances. Our aim is to specify the level of variance covered by the individual principal component. As we can see from below about 3 PCs explain almost 68% of variance. Another way to notice this in a more clear way is to plot these changes. To facilitate our plotting procedure here we have plotsPca() function which directly returns to us 4 grraphs.
plotsPca <- function(d) {
d.vr <- d$sdev ^ 2
d.pvr <- d.vr/sum(d.vr)
print("proportions of variance:")
print(d.vr)
par(mfrow=c(2,2))
plot(d.pvr,xlab="PC", ylab="Variance explained", ylim=c(0,1), type='b')
plot(cumsum(d.pvr),xlab="PC", ylab="Variance explained (cumulative)", ylim=c(0,1), type='b')
screeplot(d)
screeplot(d,type="l")
par(mfrow=c(1,1))
}
plotsPca(bank.pca)
## [1] "proportions of variance:"
## [1] 1.2518727 1.1515606 0.9741897 0.8499978 0.7723792
pca.plot1 <- princomp(bank1)
plot(pca.plot1)
pca.plot2 <- PCA(bank1, scale.unit = TRUE)
plot(pca.plot2)
summary(pca.plot2)
##
## Call:
## PCA(X = bank1, scale.unit = TRUE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Variance 1.252 1.152 0.974 0.850 0.772
## % of var. 25.037 23.031 19.484 17.000 15.448
## Cumulative % of var. 25.037 48.069 67.552 84.552 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 1.450 | 0.173 0.004 0.014 | -0.321 0.016 0.049 | -0.462
## 2 | 1.764 | -0.846 0.103 0.230 | -0.270 0.011 0.023 | 0.526
## 3 | 0.867 | -0.079 0.001 0.008 | -0.457 0.033 0.278 | -0.283
## 4 | 1.993 | -0.251 0.009 0.016 | -0.862 0.116 0.187 | -0.546
## 5 | 2.301 | -1.598 0.367 0.482 | -0.130 0.003 0.003 | -0.545
## 6 | 1.213 | 0.642 0.059 0.280 | 0.020 0.000 0.000 | -0.442
## 7 | 0.940 | 0.043 0.000 0.002 | -0.915 0.131 0.946 | -0.106
## 8 | 1.434 | -0.611 0.054 0.181 | -0.757 0.090 0.279 | -0.850
## 9 | 0.985 | -0.232 0.008 0.055 | -0.107 0.002 0.012 | -0.944
## 10 | 0.868 | -0.002 0.000 0.000 | -0.449 0.032 0.268 | -0.183
## ctr cos2
## 1 0.039 0.101 |
## 2 0.051 0.089 |
## 3 0.015 0.106 |
## 4 0.055 0.075 |
## 5 0.055 0.056 |
## 6 0.036 0.133 |
## 7 0.002 0.013 |
## 8 0.134 0.351 |
## 9 0.165 0.917 |
## 10 0.006 0.044 |
##
## Variables
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## age | -0.490 19.148 0.240 | 0.552 26.474 0.305 | 0.062 0.398
## balance | -0.429 14.695 0.184 | 0.454 17.898 0.206 | 0.578 34.263
## day | 0.635 32.251 0.404 | 0.438 16.677 0.192 | 0.151 2.326
## duration | 0.278 6.167 0.077 | -0.425 15.717 0.181 | 0.781 62.662
## campaign | 0.589 27.739 0.347 | 0.517 23.233 0.268 | -0.058 0.351
## cos2
## age 0.004 |
## balance 0.334 |
## day 0.023 |
## duration 0.610 |
## campaign 0.003 |
biplot(bank.pca, scale = 0, cex = 0.7)
rot.pca <- bank.pca
rot.pca$rotation <- -rot.pca$rotation
rot.pca$x <- -rot.pca$x
biplot(rot.pca,scale=0, cex=.7)
rot.pca$rotation[,1:4]
## PC1 PC2 PC3 PC4
## age -0.4375835 0.5145330 -0.06308829 0.6678371
## balance -0.3833436 0.4230569 -0.58534721 -0.5094708
## day 0.5679005 0.4083800 -0.15251872 -0.2549888
## duration 0.2483390 -0.3964484 -0.79159116 0.3931152
## campaign 0.5266737 0.4820095 0.05924517 0.2736316
Usually for SVD we know that: X is a matrix of n rows and p colums (nxp matrix) where Xnxp = Unxn * Dnxp * Vpxp Above we used general princomp() function, but svd itself gives us more information and control
df <- nrow(bank1) - 1
pca.svd <- scale(bank1)
pca.svd1 <- svd(pca.svd)
pca.svd1$d/sqrt(df)
## [1] 1.1188712 1.0731079 0.9870105 0.9219533 0.8788511
pca.svd1$v
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.4375835 -0.5145330 0.06308829 -0.6678371 -0.3062513800
## [2,] 0.3833436 -0.4230569 0.58534721 0.5094708 0.2681021012
## [3,] -0.5679005 -0.4083800 0.15251872 0.2549888 -0.6499489335
## [4,] -0.2483390 0.3964484 0.79159116 -0.3931152 -0.0005804307
## [5,] -0.5266737 -0.4820095 -0.05924517 -0.2736316 0.6417923360
Multidimensional scaling is mostly similar to PCA except that instead of converting correlations it simply takes into account the distances. Simply putting it converts distances among samples, simply putting each single object in N-dimensional space s.t. the distances between objects are maintained.
dist.reg <- dist(bank1)
as.matrix(dist.reg)[1:10, 1:10]
## 1 2 3 4 5 6 7
## 1 0.0000 3005.322 449.7099 333.7454 1793.3251 1041.8666 1503.0319
## 2 3005.3216 0.000 3439.1823 3313.0789 4789.0781 4042.7904 4483.6350
## 3 449.7099 3439.182 0.0000 127.5735 1350.8805 604.6445 1054.6042
## 4 333.7454 3313.079 127.5735 0.0000 1476.5361 731.5969 1177.6634
## 5 1793.3251 4789.078 1350.8805 1476.5361 0.0000 752.4194 328.7613
## 6 1041.8666 4042.790 604.6445 731.5969 752.4194 0.0000 483.4077
## 7 1503.0319 4483.635 1054.6042 1177.6634 328.7613 483.4077 0.0000
## 8 1641.6562 4642.519 1203.5290 1329.9019 166.2408 600.3374 248.5438
## 9 1566.2015 4570.915 1136.2509 1263.1053 278.9409 532.7748 296.7794
## 10 1889.5910 4877.901 1443.7081 1568.2685 125.3515 852.5902 396.0644
## 8 9 10
## 1 1641.6562 1566.2015 1889.5910
## 2 4642.5195 4570.9153 4877.9006
## 3 1203.5290 1136.2509 1443.7081
## 4 1329.9019 1263.1053 1568.2685
## 5 166.2408 278.9409 125.3515
## 6 600.3374 532.7748 852.5902
## 7 248.5438 296.7794 396.0644
## 8 0.0000 119.9166 285.6694
## 9 119.9166 0.0000 401.2867
## 10 285.6694 401.2867 0.0000
bank.mds <- cmdscale(dist.reg, 2)
summary(bank.mds)
## V1 V2
## Min. :-15408.1 Min. :-271.44
## 1st Qu.: -126.1 1st Qu.:-164.34
## Median : 985.8 Median : -78.45
## Mean : 0.0 Mean : 0.00
## 3rd Qu.: 1362.7 3rd Qu.: 65.16
## Max. : 2671.3 Max. :1602.19
plot(bank.mds)
plot(bank.mds, ylim=c(-1e+05, 1e+05))
To make MDS comparable to PCA here we use rescaled Euclidean distance where classical Torgerson Gower is double centered. Here we also check the goodness of fit which is not that substantial 48% explained by the first two dimensions.
varbank <- apply(bank1,2,var)
bank.adapted <- sweep(bank1,2,sqrt(varbank),"/")
bank.gower <- cmdscale(dist(bank.adapted),k=2,eig=T)
bank.gower$GOF
## [1] 0.4806867 0.4806867
Next, we have a 2-dimensinal solution by minimizing the loss function
eqscplot(bank.gower$points,type="n")
text(bank.gower$point,row.names(bank1),cex=.8)
Additionally we can measure the dissimilarity using the dimension reduction. Below to measure the similarity we use the correlation coefficient.
bank.vs <- cmdscale(1/cor(bank1),k=2,eig=T)
eqscplot(bank.vs$points,type="n")
text(bank.vs$point,row.names(cor(bank1)),cex=.8)
bank.dis <- dist(bank1)
bank.fit <- mds(bank.dis, ndim=2)
bank.fit
##
## Call:
## mds(delta = bank.dis, ndim = 2)
##
## Model: Symmetric SMACOF
## Number of objects: 555
## Stress-1 value: 0
## Number of iterations: 1
#summary(bank.fit)
plot(bank.fit)
It is a statistical method which weighs up a group of shapes and transforms them into a state of maximal placement of images.
x1<-bank1[,c(1:2)]
x2<-bank1[,c(3:4)]
dis.x1<-dist(x1)
dis.x2<-dist(x2)
pr.x1<-mds(dis.x1, ndim=2)
pr.x2<-mds(dis.x2, ndim=2)
plot(pr.x1)
plot(pr.x2)
Getting the MDS coordinates and plotting Procrustes Configuration Plot
plot(pr.x1$conf[,1:2])
pr1 <- Procrustes(pr.x1$conf, pr.x2$conf)
pr1
##
## Call: Procrustes(X = pr.x1$conf, Y = pr.x2$conf)
##
## Congruence coefficient: 0.372
##
## Rotation matrix:
## D1 D2
## D1 1.000 -0.003
## D2 0.003 1.000
##
## Translation vector: 0 0
## Dilation factor: 0.031
plot(pr1)
Below we show that the MDS can also be achieved using svd() function as it was done for the PCA
bank.mds1 <- bank[1:555, c(1,6,10,12:15)]
bank.mds11 <- as.matrix(sweep(bank.mds1, 2, colMeans(bank.mds1)))
ss <- svd(bank.mds11 %*% t(bank.mds11))
ss1 <- ss$v %*% sqrt(diag(ss$d))
a1 <- ss1[, 1]
b1 <- ss1[, 2]
plot(a1, b1)