Principal Component Analysis and Clustering

Metodos Estatisticos em Data Mining

Luigi Ruberto

Overview

  • Data set description
  • Principal component analysis
    • Normalization?
    • Useless variables?
  • Clustering
    • Hierarchical
    • K-means
  • Clustering on PCA
  • Conclusion

Data set

  1. CRIM: per capita crime rate by town
  2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS: proportion of non-retail business acres per town
  4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  5. NOX: nitric oxides concentration (parts per 10 million)
  6. RM: average number of rooms per dwelling
  7. AGE: proportion of owner-occupied units built prior to 1940
  8. DIS: weighted distances to five Boston employment centres
  9. RAD: index of accessibility to radial highways
  10. TAX: full-value property-tax rate per $ 10,000
  11. PTRATIO: pupil-teacher ratio by town
  12. B: 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town
  13. LSTAT: % lower status of the population
  14. MEDV: Median value of owner-occupied homes in $ 1000's

Summary

Sample of the summary of the data set

summary(house[, 1:4])
##       CRIM             ZN            INDUS            CHAS       
##  Min.   : 0.01   Min.   :  0.0   Min.   : 0.46   Min.   :0.0000  
##  1st Qu.: 0.08   1st Qu.:  0.0   1st Qu.: 5.19   1st Qu.:0.0000  
##  Median : 0.26   Median :  0.0   Median : 9.69   Median :0.0000  
##  Mean   : 3.61   Mean   : 11.4   Mean   :11.14   Mean   :0.0692  
##  3rd Qu.: 3.68   3rd Qu.: 12.5   3rd Qu.:18.10   3rd Qu.:0.0000  
##  Max.   :88.98   Max.   :100.0   Max.   :27.74   Max.   :1.0000
  • All variables are quantitative except for CHAS

    -> miningless variable

    -> discard

Ranges

##               Min     Max   Range
## CRIM      0.00632  88.976  88.970
## ZN        0.00000 100.000 100.000
## INDUS     0.46000  27.740  27.280
## CHAS      0.00000   1.000   1.000
## NOX       0.38500   0.871   0.486
## RM        3.56100   8.780   5.219
## AGE       2.90000 100.000  97.100
## DIS       1.12960  12.127  10.997
## RAD       1.00000  24.000  23.000
## TAX     187.00000 711.000 524.000
## PTRATIO  12.60000  22.000   9.400
## B         0.32000 396.900 396.580
## LSTAT     1.73000  37.970  36.240
## MEDV      5.00000  50.000  45.000
  • Ranges are very different

Without normalization

pca1 <- princomp(house, cor = FALSE)

plot of chunk unnamed-chunk-6

  • 96% of information in the first 2 components

With normalization

pca2 <- princomp(house, cor = TRUE)

plot of chunk unnamed-chunk-8

  • 75% of information in the first 2 components

Plots between each component

plot of chunk unnamed-chunk-9

Discarding variable

pca3 <- princomp(house[, -4], cor = TRUE)

plot of chunk unnamed-chunk-11

  • 79% of information in the first 2 components

Meaning of principal components

##          [,1]     [,2]     [,3]     [,4]     [,5]
##  [1,]  0.2422 -0.01172  0.40870 -0.06251  0.21283
##  [2,] -0.2455 -0.11184  0.43428 -0.30143  0.36118
##  [3,]  0.3319  0.11604 -0.08762  0.01862  0.09398
##  [4,]  0.3253  0.25894 -0.09797 -0.19339  0.13978
##  [5,] -0.2027  0.53306  0.24775  0.18533 -0.16766
##  [6,]  0.2971  0.25040 -0.25848 -0.07534  0.03343
##  [7,] -0.2983 -0.36832  0.23986 -0.02344  0.02078
##  [8,]  0.3034  0.08933  0.41446  0.21313  0.15493
##  [9,]  0.3240  0.06021  0.34094  0.14424  0.20437
## [10,]  0.2076 -0.32926  0.06369  0.70446 -0.25150
## [11,] -0.1966 -0.03080 -0.36296  0.40086  0.79103
## [12,]  0.3114 -0.24580 -0.11255 -0.28850  0.09600
## [13,] -0.2665  0.49290  0.06994  0.14318  0.04756
  • First component: average of variables
  • Second component: value of the house

Hierarchical Clustering

hc <- hclust(dist(house[, -4]))
cut <- cutreeHybrid(hc, distM = as.matrix(dist(house[, -4])))

plot of chunk unnamed-chunk-14

K-means Clustering

kc <- kmeans(house[, -4], 9)

plot of chunk unnamed-chunk-16

  • Better cluster division than before

Clustering on PCA

pcakc <- kmeans(pca3$score[, 1:4], 9)

plot of chunk unnamed-chunk-18