- Data set description
- Principal component analysis
- Normalization?
- Useless variables?
- Clustering
- Hierarchical
- K-means
- Clustering on PCA
- Conclusion
Luigi Ruberto
Sample of the summary of the data set
summary(house[, 1:4])
## CRIM ZN INDUS CHAS
## Min. : 0.01 Min. : 0.0 Min. : 0.46 Min. :0.0000
## 1st Qu.: 0.08 1st Qu.: 0.0 1st Qu.: 5.19 1st Qu.:0.0000
## Median : 0.26 Median : 0.0 Median : 9.69 Median :0.0000
## Mean : 3.61 Mean : 11.4 Mean :11.14 Mean :0.0692
## 3rd Qu.: 3.68 3rd Qu.: 12.5 3rd Qu.:18.10 3rd Qu.:0.0000
## Max. :88.98 Max. :100.0 Max. :27.74 Max. :1.0000
All variables are quantitative except for CHAS
-> miningless variable
-> discard
## Min Max Range
## CRIM 0.00632 88.976 88.970
## ZN 0.00000 100.000 100.000
## INDUS 0.46000 27.740 27.280
## CHAS 0.00000 1.000 1.000
## NOX 0.38500 0.871 0.486
## RM 3.56100 8.780 5.219
## AGE 2.90000 100.000 97.100
## DIS 1.12960 12.127 10.997
## RAD 1.00000 24.000 23.000
## TAX 187.00000 711.000 524.000
## PTRATIO 12.60000 22.000 9.400
## B 0.32000 396.900 396.580
## LSTAT 1.73000 37.970 36.240
## MEDV 5.00000 50.000 45.000
pca1 <- princomp(house, cor = FALSE)
pca2 <- princomp(house, cor = TRUE)
Plots between each component
pca3 <- princomp(house[, -4], cor = TRUE)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.2422 -0.01172 0.40870 -0.06251 0.21283
## [2,] -0.2455 -0.11184 0.43428 -0.30143 0.36118
## [3,] 0.3319 0.11604 -0.08762 0.01862 0.09398
## [4,] 0.3253 0.25894 -0.09797 -0.19339 0.13978
## [5,] -0.2027 0.53306 0.24775 0.18533 -0.16766
## [6,] 0.2971 0.25040 -0.25848 -0.07534 0.03343
## [7,] -0.2983 -0.36832 0.23986 -0.02344 0.02078
## [8,] 0.3034 0.08933 0.41446 0.21313 0.15493
## [9,] 0.3240 0.06021 0.34094 0.14424 0.20437
## [10,] 0.2076 -0.32926 0.06369 0.70446 -0.25150
## [11,] -0.1966 -0.03080 -0.36296 0.40086 0.79103
## [12,] 0.3114 -0.24580 -0.11255 -0.28850 0.09600
## [13,] -0.2665 0.49290 0.06994 0.14318 0.04756
hc <- hclust(dist(house[, -4]))
cut <- cutreeHybrid(hc, distM = as.matrix(dist(house[, -4])))
kc <- kmeans(house[, -4], 9)
pcakc <- kmeans(pca3$score[, 1:4], 9)