Principal Component Analysis and Clustering

Metodos Estatisticos em Data Mining

Luigi Ruberto

Overview

Data set description
Principal component analysis
- Normalization?
- Useless variables?
Clustering
- Hierarchical
- K-means
Clustering on PCA
Conclusion

Data set

CRIM: per capita crime rate by town
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: nitric oxides concentration (parts per 10 million)
RM: average number of rooms per dwelling
AGE: proportion of owner-occupied units built prior to 1940
DIS: weighted distances to five Boston employment centres
RAD: index of accessibility to radial highways
TAX: full-value property-tax rate per $ 10,000
PTRATIO: pupil-teacher ratio by town
B: 1000(Bk - 0.63)² where Bk is the proportion of blacks by town
LSTAT: % lower status of the population
MEDV: Median value of owner-occupied homes in $ 1000's

Summary

Sample of the summary of the data set

summary(house[, 1:4])

##       CRIM             ZN            INDUS            CHAS       
##  Min.   : 0.01   Min.   :  0.0   Min.   : 0.46   Min.   :0.0000  
##  1st Qu.: 0.08   1st Qu.:  0.0   1st Qu.: 5.19   1st Qu.:0.0000  
##  Median : 0.26   Median :  0.0   Median : 9.69   Median :0.0000  
##  Mean   : 3.61   Mean   : 11.4   Mean   :11.14   Mean   :0.0692  
##  3rd Qu.: 3.68   3rd Qu.: 12.5   3rd Qu.:18.10   3rd Qu.:0.0000  
##  Max.   :88.98   Max.   :100.0   Max.   :27.74   Max.   :1.0000

All variables are quantitative except for CHAS

-> miningless variable

-> discard

Ranges

##               Min     Max   Range
## CRIM      0.00632  88.976  88.970
## ZN        0.00000 100.000 100.000
## INDUS     0.46000  27.740  27.280
## CHAS      0.00000   1.000   1.000
## NOX       0.38500   0.871   0.486
## RM        3.56100   8.780   5.219
## AGE       2.90000 100.000  97.100
## DIS       1.12960  12.127  10.997
## RAD       1.00000  24.000  23.000
## TAX     187.00000 711.000 524.000
## PTRATIO  12.60000  22.000   9.400
## B         0.32000 396.900 396.580
## LSTAT     1.73000  37.970  36.240
## MEDV      5.00000  50.000  45.000

Ranges are very different

Without normalization

pca1 <- princomp(house, cor = FALSE)

plot of chunk unnamed-chunk-6

96% of information in the first 2 components

With normalization

pca2 <- princomp(house, cor = TRUE)

plot of chunk unnamed-chunk-8

75% of information in the first 2 components

Plots between each component

plot of chunk unnamed-chunk-9

Discarding variable

pca3 <- princomp(house[, -4], cor = TRUE)

plot of chunk unnamed-chunk-11

79% of information in the first 2 components

Meaning of principal components

##          [,1]     [,2]     [,3]     [,4]     [,5]
##  [1,]  0.2422 -0.01172  0.40870 -0.06251  0.21283
##  [2,] -0.2455 -0.11184  0.43428 -0.30143  0.36118
##  [3,]  0.3319  0.11604 -0.08762  0.01862  0.09398
##  [4,]  0.3253  0.25894 -0.09797 -0.19339  0.13978
##  [5,] -0.2027  0.53306  0.24775  0.18533 -0.16766
##  [6,]  0.2971  0.25040 -0.25848 -0.07534  0.03343
##  [7,] -0.2983 -0.36832  0.23986 -0.02344  0.02078
##  [8,]  0.3034  0.08933  0.41446  0.21313  0.15493
##  [9,]  0.3240  0.06021  0.34094  0.14424  0.20437
## [10,]  0.2076 -0.32926  0.06369  0.70446 -0.25150
## [11,] -0.1966 -0.03080 -0.36296  0.40086  0.79103
## [12,]  0.3114 -0.24580 -0.11255 -0.28850  0.09600
## [13,] -0.2665  0.49290  0.06994  0.14318  0.04756

First component: average of variables
Second component: value of the house

Hierarchical Clustering

hc <- hclust(dist(house[, -4]))
cut <- cutreeHybrid(hc, distM = as.matrix(dist(house[, -4])))

plot of chunk unnamed-chunk-14

K-means Clustering

kc <- kmeans(house[, -4], 9)

plot of chunk unnamed-chunk-16

Better cluster division than before

Clustering on PCA

pcakc <- kmeans(pca3$score[, 1:4], 9)

plot of chunk unnamed-chunk-18