Step 1: Loading data

In this short survey we are going to have a dimension reduction on Bank Marketing data.The data is related to the campaigns which was hold by one of Portuguese banking institution which was mainly based on the phone calls made, often contacting a client more than one time. The aim is to use PCA and MDS together with the plots. Our dataset was acquired from the http://archive.ics.uci.edu/ml/index.php. The dataset and attribute information can be found at http://archive.ics.uci.edu/ml/datasets/Bank+Marketing.

Following libraries should be downloaded before start:

#library(factoextra)
#library(labdsv)
#library(vegan)
#library(ape)
#library(ggfortify)
#library(factoextra)
#library(pca3d)
#library(pls)
#library(ClusterR)
#library(labdsv)
#library(devtools)
#library(ggplot2)
#library(readstata13)

library(FactoMineR)
library(MASS)
library(smacof)
## Loading required package: plotrix

Step 1: Arrangement of the dataset

Sometimes while dealing with data you can face with the problem of having a high dimensionality data (where you have more variables then observations) which most probably has a multicollinearity problem. To transform the dataset into a more reliable subset one has to decrease the dimension of the data. For this you can make use of the Principal component Analysis (PCA) and Multidimensional Scaling (MDS). Firstly we are going to go through the PCA. Here we generally decrease. It generally consists of 5 steps:

  1. Normalization: computing mean for each parameter and then subtract the mean from data
  2. Feature scaling
  3. Getting covariance matrix
  4. Diagonalize covariance matrix using SVD (by finding eigenvectors and eigenvalues)
  5. Reconstruction - approximation of original data (compressing data points)
##   age balance day duration campaign
## 1  30    1787  19       79        1
## 2  33    4789  11      220        1
## 3  35    1350  16      185        1
## 4  30    1476   3      199        4
## 5  59       0   5      226        1
## 6  35     747  23      141        2
## 'data.frame':    555 obs. of  5 variables:
##  $ age     : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ balance : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ day     : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ duration: int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign: int  1 1 1 4 1 2 1 2 2 1 ...
##       age           balance           day           duration     
##  Min.   :19.00   Min.   :-1206   Min.   : 1.00   Min.   :   5.0  
##  1st Qu.:33.00   1st Qu.:  102   1st Qu.: 9.00   1st Qu.: 109.0  
##  Median :40.00   Median :  480   Median :16.00   Median : 194.0  
##  Mean   :41.29   Mean   : 1465   Mean   :15.91   Mean   : 271.3  
##  3rd Qu.:49.00   3rd Qu.: 1592   3rd Qu.:21.00   3rd Qu.: 336.5  
##  Max.   :78.00   Max.   :16873   Max.   :31.00   Max.   :1877.0  
##     campaign     
##  Min.   : 1.000  
##  1st Qu.: 1.000  
##  Median : 2.000  
##  Mean   : 2.595  
##  3rd Qu.: 3.000  
##  Max.   :32.000

Step 2: Stardardization process via the application of scaling

Adjustment of prcomp() function prcomp() - Performs a PCA on the given data matrix and returns the results as an object. Results are rotation, loadings and the standart deviation

bank.pca <- prcomp(scale(bank1), center = TRUE, scale = TRUE)
names(bank.pca)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
print(bank.pca)
## Standard deviations (1, .., p=5):
## [1] 1.1188712 1.0731079 0.9870105 0.9219533 0.8788511
## 
## Rotation (n x k) = (5 x 5):
##                 PC1        PC2         PC3        PC4           PC5
## age       0.4375835 -0.5145330  0.06308829 -0.6678371 -0.3062513800
## balance   0.3833436 -0.4230569  0.58534721  0.5094708  0.2681021012
## day      -0.5679005 -0.4083800  0.15251872  0.2549888 -0.6499489335
## duration -0.2483390  0.3964484  0.79159116 -0.3931152 -0.0005804307
## campaign -0.5266737 -0.4820095 -0.05924517 -0.2736316  0.6417923360

Step 3: Visualizing eigenvalues and choosing varibles with highest variances

summary(bank.pca)
## Importance of components%s:
##                           PC1    PC2    PC3    PC4    PC5
## Standard deviation     1.1189 1.0731 0.9870 0.9220 0.8789
## Proportion of Variance 0.2504 0.2303 0.1948 0.1700 0.1545
## Cumulative Proportion  0.2504 0.4807 0.6755 0.8455 1.0000

Here we have 5 varibles, principal components, where you can choose (specific number of PCs) with the highest variances. Our aim is to specify the level of variance covered by the individual principal component. As we can see from below about 3 PCs explain almost 68% of variance. Another way to notice this in a more clear way is to plot these changes. To facilitate our plotting procedure here we have plotsPca() function which directly returns to us 4 grraphs.

plotsPca <- function(d) {
  d.vr <- d$sdev ^ 2
  d.pvr <- d.vr/sum(d.vr)
  print("proportions of variance:")
  print(d.vr)
  
  par(mfrow=c(2,2))
  plot(d.pvr,xlab="PC", ylab="Variance explained", ylim=c(0,1), type='b')
  plot(cumsum(d.pvr),xlab="PC", ylab="Variance explained (cumulative)", ylim=c(0,1), type='b')
  screeplot(d)
  screeplot(d,type="l")
  par(mfrow=c(1,1))
}

plotsPca(bank.pca)
## [1] "proportions of variance:"
## [1] 1.2518727 1.1515606 0.9741897 0.8499978 0.7723792

Step 4: Additional plots of PCA using princomp() function

pca.plot1 <- princomp(bank1)  
plot(pca.plot1)

pca.plot2 <- PCA(bank1, scale.unit = TRUE)

plot(pca.plot2)

summary(pca.plot2)
## 
## Call:
## PCA(X = bank1, scale.unit = TRUE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5
## Variance               1.252   1.152   0.974   0.850   0.772
## % of var.             25.037  23.031  19.484  17.000  15.448
## Cumulative % of var.  25.037  48.069  67.552  84.552 100.000
## 
## Individuals (the 10 first)
##              Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1        |  1.450 |  0.173  0.004  0.014 | -0.321  0.016  0.049 | -0.462
## 2        |  1.764 | -0.846  0.103  0.230 | -0.270  0.011  0.023 |  0.526
## 3        |  0.867 | -0.079  0.001  0.008 | -0.457  0.033  0.278 | -0.283
## 4        |  1.993 | -0.251  0.009  0.016 | -0.862  0.116  0.187 | -0.546
## 5        |  2.301 | -1.598  0.367  0.482 | -0.130  0.003  0.003 | -0.545
## 6        |  1.213 |  0.642  0.059  0.280 |  0.020  0.000  0.000 | -0.442
## 7        |  0.940 |  0.043  0.000  0.002 | -0.915  0.131  0.946 | -0.106
## 8        |  1.434 | -0.611  0.054  0.181 | -0.757  0.090  0.279 | -0.850
## 9        |  0.985 | -0.232  0.008  0.055 | -0.107  0.002  0.012 | -0.944
## 10       |  0.868 | -0.002  0.000  0.000 | -0.449  0.032  0.268 | -0.183
##             ctr   cos2  
## 1         0.039  0.101 |
## 2         0.051  0.089 |
## 3         0.015  0.106 |
## 4         0.055  0.075 |
## 5         0.055  0.056 |
## 6         0.036  0.133 |
## 7         0.002  0.013 |
## 8         0.134  0.351 |
## 9         0.165  0.917 |
## 10        0.006  0.044 |
## 
## Variables
##             Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## age      | -0.490 19.148  0.240 |  0.552 26.474  0.305 |  0.062  0.398
## balance  | -0.429 14.695  0.184 |  0.454 17.898  0.206 |  0.578 34.263
## day      |  0.635 32.251  0.404 |  0.438 16.677  0.192 |  0.151  2.326
## duration |  0.278  6.167  0.077 | -0.425 15.717  0.181 |  0.781 62.662
## campaign |  0.589 27.739  0.347 |  0.517 23.233  0.268 | -0.058  0.351
##            cos2  
## age       0.004 |
## balance   0.334 |
## day       0.023 |
## duration  0.610 |
## campaign  0.003 |
biplot(bank.pca, scale = 0, cex = 0.7)

rot.pca <- bank.pca
rot.pca$rotation <- -rot.pca$rotation
rot.pca$x <- -rot.pca$x
biplot(rot.pca,scale=0, cex=.7)

rot.pca$rotation[,1:4]
##                 PC1        PC2         PC3        PC4
## age      -0.4375835  0.5145330 -0.06308829  0.6678371
## balance  -0.3833436  0.4230569 -0.58534721 -0.5094708
## day       0.5679005  0.4083800 -0.15251872 -0.2549888
## duration  0.2483390 -0.3964484 -0.79159116  0.3931152
## campaign  0.5266737  0.4820095  0.05924517  0.2736316

SVD for PCA

Usually for SVD we know that: X is a matrix of n rows and p colums (nxp matrix) where Xnxp = Unxn * Dnxp * Vpxp Above we used general princomp() function, but svd itself gives us more information and control

df <- nrow(bank1) - 1
pca.svd <- scale(bank1)
pca.svd1 <- svd(pca.svd)
pca.svd1$d/sqrt(df)
## [1] 1.1188712 1.0731079 0.9870105 0.9219533 0.8788511
pca.svd1$v
##            [,1]       [,2]        [,3]       [,4]          [,5]
## [1,]  0.4375835 -0.5145330  0.06308829 -0.6678371 -0.3062513800
## [2,]  0.3833436 -0.4230569  0.58534721  0.5094708  0.2681021012
## [3,] -0.5679005 -0.4083800  0.15251872  0.2549888 -0.6499489335
## [4,] -0.2483390  0.3964484  0.79159116 -0.3931152 -0.0005804307
## [5,] -0.5266737 -0.4820095 -0.05924517 -0.2736316  0.6417923360

Multidimensional Scaling (MDS) on Bank marketing data

Multidimensional scaling is mostly similar to PCA except that instead of converting correlations it simply takes into account the distances. Simply putting it converts distances among samples, simply putting each single object in N-dimensional space s.t. the distances between objects are maintained.

Using traditional cmdscale()

dist.reg <- dist(bank1)
as.matrix(dist.reg)[1:10, 1:10]
##            1        2         3         4         5         6         7
## 1     0.0000 3005.322  449.7099  333.7454 1793.3251 1041.8666 1503.0319
## 2  3005.3216    0.000 3439.1823 3313.0789 4789.0781 4042.7904 4483.6350
## 3   449.7099 3439.182    0.0000  127.5735 1350.8805  604.6445 1054.6042
## 4   333.7454 3313.079  127.5735    0.0000 1476.5361  731.5969 1177.6634
## 5  1793.3251 4789.078 1350.8805 1476.5361    0.0000  752.4194  328.7613
## 6  1041.8666 4042.790  604.6445  731.5969  752.4194    0.0000  483.4077
## 7  1503.0319 4483.635 1054.6042 1177.6634  328.7613  483.4077    0.0000
## 8  1641.6562 4642.519 1203.5290 1329.9019  166.2408  600.3374  248.5438
## 9  1566.2015 4570.915 1136.2509 1263.1053  278.9409  532.7748  296.7794
## 10 1889.5910 4877.901 1443.7081 1568.2685  125.3515  852.5902  396.0644
##            8         9        10
## 1  1641.6562 1566.2015 1889.5910
## 2  4642.5195 4570.9153 4877.9006
## 3  1203.5290 1136.2509 1443.7081
## 4  1329.9019 1263.1053 1568.2685
## 5   166.2408  278.9409  125.3515
## 6   600.3374  532.7748  852.5902
## 7   248.5438  296.7794  396.0644
## 8     0.0000  119.9166  285.6694
## 9   119.9166    0.0000  401.2867
## 10  285.6694  401.2867    0.0000
bank.mds <- cmdscale(dist.reg, 2) 
summary(bank.mds)
##        V1                 V2         
##  Min.   :-15408.1   Min.   :-271.44  
##  1st Qu.:  -126.1   1st Qu.:-164.34  
##  Median :   985.8   Median : -78.45  
##  Mean   :     0.0   Mean   :   0.00  
##  3rd Qu.:  1362.7   3rd Qu.:  65.16  
##  Max.   :  2671.3   Max.   :1602.19
plot(bank.mds)

plot(bank.mds, ylim=c(-1e+05, 1e+05)) 

Torgerson Gower scaling

To make MDS comparable to PCA here we use rescaled Euclidean distance where classical Torgerson Gower is double centered. Here we also check the goodness of fit which is not that substantial 48% explained by the first two dimensions.

varbank <- apply(bank1,2,var)
bank.adapted <- sweep(bank1,2,sqrt(varbank),"/")
bank.gower <- cmdscale(dist(bank.adapted),k=2,eig=T)

bank.gower$GOF
## [1] 0.4806867 0.4806867

Next, we have a 2-dimensinal solution by minimizing the loss function

eqscplot(bank.gower$points,type="n")
text(bank.gower$point,row.names(bank1),cex=.8)

Additionally we can measure the dissimilarity using the dimension reduction. Below to measure the similarity we use the correlation coefficient.

bank.vs <- cmdscale(1/cor(bank1),k=2,eig=T)
eqscplot(bank.vs$points,type="n")
text(bank.vs$point,row.names(cor(bank1)),cex=.8)

Advanced functions applied to MDS

bank.dis <- dist(bank1)
bank.fit <- mds(bank.dis, ndim=2)
bank.fit
## 
## Call:
## mds(delta = bank.dis, ndim = 2)
## 
## Model: Symmetric SMACOF 
## Number of objects: 555 
## Stress-1 value: 0 
## Number of iterations: 1
#summary(bank.fit)
plot(bank.fit)

Procrustes analysis

It is a statistical method which weighs up a group of shapes and transforms them into a state of maximal placement of images.

x1<-bank1[,c(1:2)]
x2<-bank1[,c(3:4)]
dis.x1<-dist(x1)
dis.x2<-dist(x2)
pr.x1<-mds(dis.x1, ndim=2)
pr.x2<-mds(dis.x2, ndim=2)
plot(pr.x1)

plot(pr.x2)

Getting the MDS coordinates and plotting Procrustes Configuration Plot

plot(pr.x1$conf[,1:2]) 

pr1 <- Procrustes(pr.x1$conf, pr.x2$conf) 
pr1
## 
## Call: Procrustes(X = pr.x1$conf, Y = pr.x2$conf)
## 
## Congruence coefficient: 0.372
## 
## Rotation matrix:
##       D1     D2
## D1 1.000 -0.003
## D2 0.003  1.000
## 
## Translation vector: 0 0 
## Dilation factor: 0.031
plot(pr1)

MDS using SVD

Below we show that the MDS can also be achieved using svd() function as it was done for the PCA

bank.mds1 <- bank[1:555, c(1,6,10,12:15)]
bank.mds11 <- as.matrix(sweep(bank.mds1, 2, colMeans(bank.mds1)))
ss <- svd(bank.mds11 %*% t(bank.mds11))
ss1 <- ss$v %*% sqrt(diag(ss$d))
a1 <- ss1[, 1]
b1 <- ss1[, 2]
plot(a1, b1)