NASA’s Kepler Space Telescope was an observatory in space dedicated to finding planet systems that my resemble ours. It has been collecting data for 9 years starting from 2009 up to 2018. Since its launch telescope discovered thousands of stars, extra-solar planets and exoplanets.
The analyzed dataset was published as-is by NASA. It is a cumulative record of all observed obejcts by Kepler (up to 2016) - confirmed exoplanets or candidates. To find more about the other variables and the topic itself please visit CalTech site or see original table here.
## 'data.frame': 9564 obs. of 50 variables:
## $ rowid : int 1 2 3 4 5 6 7 8 9 10 ...
## $ kepid : int 10797460 10797460 10811496 10848459 10854555 10872983 10872983 10872983 6721123 10910878 ...
## $ kepoi_name : Factor w/ 9564 levels "K00001.01","K00002.01",..: 1081 1082 1083 1084 1085 1086 1087 1088 108 1089 ...
## $ kepler_name : Factor w/ 2295 levels "","Kepler-1 b",..: 1037 1038 1 1 1869 1041 1040 1039 1 1043 ...
## $ koi_disposition : Factor w/ 3 levels "CANDIDATE","CONFIRMED",..: 2 2 3 3 2 2 2 2 3 2 ...
## $ koi_pdisposition : Factor w/ 2 levels "CANDIDATE","FALSE POSITIVE": 1 1 2 2 1 1 1 1 2 1 ...
## $ koi_score : num 1 0.969 0 0 1 1 1 0.992 0 1 ...
## $ koi_fpflag_nt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ koi_fpflag_ss : int 0 0 1 1 0 0 0 0 1 0 ...
## $ koi_fpflag_co : int 0 0 0 0 0 0 0 0 1 0 ...
## $ koi_fpflag_ec : int 0 0 0 0 0 0 0 0 0 0 ...
## $ koi_period : num 9.49 54.42 19.9 1.74 2.53 ...
## $ koi_period_err1 : num 2.78e-05 2.48e-04 1.49e-05 2.63e-07 3.76e-06 ...
## $ koi_period_err2 : num -2.78e-05 -2.48e-04 -1.49e-05 -2.63e-07 -3.76e-06 ...
## $ koi_time0bk : num 171 163 176 170 172 ...
## $ koi_time0bk_err1 : num 0.00216 0.00352 0.000581 0.000115 0.00113 0.00141 0.0019 0.00461 0.00253 0.000517 ...
## $ koi_time0bk_err2 : num -0.00216 -0.00352 -0.000581 -0.000115 -0.00113 -0.00141 -0.0019 -0.00461 -0.00253 -0.000517 ...
## $ koi_impact : num 0.146 0.586 0.969 1.276 0.701 ...
## $ koi_impact_err1 : num 0.318 0.059 5.126 0.115 0.235 ...
## $ koi_impact_err2 : num -0.146 -0.443 -0.077 -0.092 -0.478 -0.428 -0.532 -0.523 -0.044 -0.052 ...
## $ koi_duration : num 2.96 4.51 1.78 2.41 1.65 ...
## $ koi_duration_err1: num 0.0819 0.116 0.0341 0.00537 0.042 0.061 0.0673 0.165 0.136 0.0241 ...
## $ koi_duration_err2: num -0.0819 -0.116 -0.0341 -0.00537 -0.042 -0.061 -0.0673 -0.165 -0.136 -0.0241 ...
## $ koi_depth : num 616 875 10829 8079 603 ...
## $ koi_depth_err1 : num 19.5 35.5 171 12.8 16.9 24.2 18.7 16.8 5.8 33.3 ...
## $ koi_depth_err2 : num -19.5 -35.5 -171 -12.8 -16.9 -24.2 -18.7 -16.8 -5.8 -33.3 ...
## $ koi_prad : num 2.26 2.83 14.6 33.46 2.75 ...
## $ koi_prad_err1 : num 0.26 0.32 3.92 8.5 0.88 1.27 0.9 0.52 6.45 0.22 ...
## $ koi_prad_err2 : num -0.15 -0.19 -1.31 -2.83 -0.35 -0.42 -0.3 -0.17 -9.67 -0.49 ...
## $ koi_teq : num 793 443 638 1395 1406 ...
## $ koi_teq_err1 : logi NA NA NA NA NA NA ...
## $ koi_teq_err2 : logi NA NA NA NA NA NA ...
## $ koi_insol : num 93.59 9.11 39.3 891.96 926.16 ...
## $ koi_insol_err1 : num 29.45 2.87 31.04 668.95 874.33 ...
## $ koi_insol_err2 : num -16.65 -1.62 -10.49 -230.35 -314.24 ...
## $ koi_model_snr : num 35.8 25.8 76.3 505.6 40.9 ...
## $ koi_tce_plnt_num : int 1 2 1 1 1 1 2 3 1 1 ...
## $ koi_tce_delivname: Factor w/ 4 levels "","q1_q16_tce",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ koi_steff : num 5455 5455 5853 5805 6031 ...
## $ koi_steff_err1 : num 81 81 158 157 169 189 189 189 111 75 ...
## $ koi_steff_err2 : num -81 -81 -176 -174 -211 -232 -232 -232 -124 -83 ...
## $ koi_slogg : num 4.47 4.47 4.54 4.56 4.44 ...
## $ koi_slogg_err1 : num 0.064 0.064 0.044 0.053 0.07 0.054 0.054 0.054 0.182 0.083 ...
## $ koi_slogg_err2 : num -0.096 -0.096 -0.176 -0.168 -0.21 -0.229 -0.229 -0.229 -0.098 -0.028 ...
## $ koi_srad : num 0.927 0.927 0.868 0.791 1.046 ...
## $ koi_srad_err1 : num 0.105 0.105 0.233 0.201 0.334 0.315 0.315 0.315 0.322 0.033 ...
## $ koi_srad_err2 : num -0.061 -0.061 -0.078 -0.067 -0.133 -0.105 -0.105 -0.105 -0.483 -0.072 ...
## $ ra : num 292 292 297 286 289 ...
## $ dec : num 48.1 48.1 48.1 48.3 48.2 ...
## $ koi_kepmag : num 15.3 15.3 15.4 15.6 15.5 ...
As clearly visible, data is of a mixed type and, therefore, the analysis will be divided into two parts - one for continuous data type and one for mixed data types.
Original dataframe consists of 50 variables and almost 10 000 observed exoplanets (KOI - Kepler Objects of Interest). This dataset has an extensive data dictionary, which can be accessed here. Highlightable columns of note are:
Besides abovementioned variables, the dataset contains measurement error variables that are not necessary for the cluster analysis. Those columns were deleted together with ID columns. The rows containing missing values were also removed, since when it comes to planets’ features it is impossible to replace those values with some statistics. Reduced dataset contains 23 variables and almost 8 000 observations.
Before further analysis both mixed and continuous datasets were standardized.
## Warning in data.Normalization(kepler_mix, type = "n1", normalization =
## "column"): Data not numeric, normalization not applicable
## Warning in data.Normalization(kepler_mix, type = "n1", normalization =
## "column"): Data not numeric, normalization not applicable
To determine if given dataset is clusterable, three tests were performed for both continues and mixed data. Hopkins statistic measures probability that given data was generated by a uniform distribution. Another taken measure was Dip test that compares the empirical distribution to a uniform density and rejects the assumption of unimodality if the data is sufficiently different from the closest possible uniform distribution. The last test performed was a Silverman test. This test is based on kernel density and approximates given distribution with a set of Gaussians of a specified bandwidth.
hopkins(n.kepler, n=100)
## $H
## [1] 0.0217199
clusterabilitytest(n.kepler, "dip",
reduction = "none",
distance_metric = "euclidean",
distance_standardize = "none",
is_dist_matrix = FALSE,
completecase = FALSE,
d_simulatepvalue = FALSE)
## n = 127904 > max_n{n in table} = 72000 -- using that as asymptotic value.
## ----------------------
## Clusterability Test
## ----------------------
##
## Data set name: n.kepler
## Your data set has 7994 observation(s) and 16 variable(s).
## There were no missing values. Your data set is complete.
##
##
## -----------------------------------------
## Results: Dip Test of Unimodality
## -----------------------------------------
##
## Null Hypothesis: number of modes = 1
## Alternative Hypothesis: number of modes > 1
## p-value: 0
## Dip statistic: 0.0317236193826746
##
## ---------------------
## Test Options Used
## ---------------------
##
## Default values for the optional parameters were used. To learn more about customizing the behavior of the clusterabilitytest, please see the R documentation.
clusterabilitytest(n.kepler, "silverman",
reduction = "none",
distance_metric = "euclidean",
distance_standardize = "none",
is_dist_matrix = FALSE,
completecase = FALSE,
d_simulatepvalue = FALSE,
s_m = 999, s_adjust = TRUE)
## ----------------------
## Clusterability Test
## ----------------------
##
## Data set name: n.kepler
## Your data set has 7994 observation(s) and 16 variable(s).
## There were no missing values. Your data set is complete.
##
##
## ----------------------------------------------
## Results: Silverman's Critical Bandwidth Test
## ----------------------------------------------
##
## Null Hypothesis: number of modes <= 1
## Alternative Hypothesis: number of modes > 1
## p-value: 0
## Critical bandwidth: 11.060238
##
## ---------------------
## Test Options Used
## ---------------------
##
## p value based on 999 bootstrap replicates
## Adjusted p-values based on the adjustment in Hall and York (2001)
## p value rounded to: 6 digits
hopkins(n.kepler_mix, n=100)
## $H
## [1] 0.02603993
clusterabilitytest(n.kepler_mix, "dip",
reduction = "none",
distance_metric = "euclidean",
distance_standardize = "none",
is_dist_matrix = FALSE,
completecase = FALSE,
d_simulatepvalue = FALSE)
## n = 183862 > max_n{n in table} = 72000 -- using that as asymptotic value.
## ----------------------
## Clusterability Test
## ----------------------
##
## Data set name: n.kepler_mix
## Your data set has 7994 observation(s) and 23 variable(s).
## There were no missing values. Your data set is complete.
##
##
## -----------------------------------------
## Results: Dip Test of Unimodality
## -----------------------------------------
##
## Null Hypothesis: number of modes = 1
## Alternative Hypothesis: number of modes > 1
## p-value: 0
## Dip statistic: 0.0246940455077549
##
## ---------------------
## Test Options Used
## ---------------------
##
## Default values for the optional parameters were used. To learn more about customizing the behavior of the clusterabilitytest, please see the R documentation.
clusterabilitytest(n.kepler_mix, "silverman",
reduction = "none",
distance_metric = "euclidean",
distance_standardize = "none",
is_dist_matrix = FALSE,
completecase = FALSE,
d_simulatepvalue = FALSE,
s_m = 999, s_adjust = TRUE)
## ----------------------
## Clusterability Test
## ----------------------
##
## Data set name: n.kepler_mix
## Your data set has 7994 observation(s) and 23 variable(s).
## There were no missing values. Your data set is complete.
##
##
## ----------------------------------------------
## Results: Silverman's Critical Bandwidth Test
## ----------------------------------------------
##
## Null Hypothesis: number of modes <= 1
## Alternative Hypothesis: number of modes > 1
## p-value: 0
## Critical bandwidth: 11.030802
##
## ---------------------
## Test Options Used
## ---------------------
##
## p value based on 999 bootstrap replicates
## Adjusted p-values based on the adjustment in Hall and York (2001)
## p value rounded to: 6 digits
For both data sets result of Hopkins test is close to 0, so the null hypothesis is rejected, which means that data came from multiple distributions. Silverman and Dip tests returned p-value rounded to 0, which indicates that null hypothesis for these tests also should be rejected for both datasets and they could be clustered.
The centroid (k-means) algorithm is one of the most popular algorithms used in cluster analysis, used also in vector quantization. It aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This algorithm scales to large data sets and guarantees convergence, however, can be performed only on continuous data.
For the k-means algorithm the number of clusters has to be defined beforehand, so to make the most accurate choice several methods are used to determine k.
fviz_nbclust(n.kepler, FUNcluster = kmeans, method = c("silhouette"), k.max = 10, nboot = 100,)
fviz_nbclust(n.kepler, FUNcluster = kmeans, method = c("wss"), k.max = 10, nboot = 100,)
Optimal_Clusters_KMeans(n.kepler, max_clusters=10, plot_clusters = TRUE)
## [1] 1.0000000 0.9121733 0.8188333 0.7557514 0.6666948 0.6411550 0.5976459
## [8] 0.5752367 0.5594080 0.5449191
## attr(,"class")
## [1] "k-means clustering"
Optimal_Clusters_KMeans(n.kepler, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
## [1] 0.0000000 0.2414085 0.1399263 0.1347962 0.1162944 0.1340886 0.1206460
## [8] 0.1087086 0.1066141 0.1065465
## attr(,"class")
## [1] "k-means clustering"
The highest silhouette width is visible for 2 clusters. The other graphs do not show clearly the particular number, just that the more clusters, the better, however, big amount of clusters is not a recommended. For the further analysis 2, 3, 4 and 6 clusters were chosen
Compared results of different clusters numbers.
kmeans_2 <- eclust(n.kepler, "kmeans", hc_metric="euclidean", k=2)
fviz_cluster(kmeans_2, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
kmeans_3 <- eclust(n.kepler, "kmeans", hc_metric="euclidean", k=3)
fviz_cluster(kmeans_3, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
kmeans_4 <- eclust(n.kepler, "kmeans", hc_metric="euclidean", k=4)
fviz_cluster(kmeans_4, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
kmeans_6 <- eclust(n.kepler, "kmeans", hc_metric="euclidean", k=6)
fviz_cluster(kmeans_6, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. It ranges from -1 to 1, and the highest value, the better fit of the data to the cluster.
fviz_silhouette(kmeans_2)
## cluster size ave.sil.width
## 1 1 6374 0.39
## 2 2 1620 -0.09
fviz_silhouette(kmeans_3)
## cluster size ave.sil.width
## 1 1 3289 -0.05
## 2 2 576 0.12
## 3 3 4129 0.33
fviz_silhouette(kmeans_4)
## cluster size ave.sil.width
## 1 1 566 0.10
## 2 2 566 0.11
## 3 3 4088 0.29
## 4 4 2774 0.03
fviz_silhouette(kmeans_6)
## cluster size ave.sil.width
## 1 1 3666 0.24
## 2 2 21 -0.03
## 3 3 2297 0.14
## 4 4 1007 0.03
## 5 5 500 0.13
## 6 6 503 0.13
Silhouette width achieves the highest value for 2 clusters, as was predicted on the graph beforehand. The value is equal to 0.29 which is relatively low. The first cluster is massive while the second one achieves many values below 0. The other silhouettes widths are even lower and the clustering quality is worse. The best fit graphucally is achieved for 6 clusters due to little values below 0.
To evaluate the quality of clustering two test measures were chosen:
Calinski-Harabasz test:
round(calinhara(n.kepler, kmeans_2$cluster),digits=2)
## [1] 867.7
Duda-Hart test:
dudahart2(n.kepler, kmeans_2$cluster)
## $p.value
## [1] 0
##
## $dh
## [1] 0.9020619
##
## $compare
## [1] 0.948305
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
Calinski-Harabasz test:
round(calinhara(n.kepler, kmeans_3$cluster),digits=2)
## [1] 769.6
Duda-Hart test:
dudahart2(n.kepler, kmeans_3$cluster)
## $p.value
## [1] 0
##
## $dh
## [1] 0.6521784
##
## $compare
## [1] 0.948305
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
Calinski-Harabasz test:
round(calinhara(n.kepler, kmeans_4$cluster),digits=2)
## [1] 860.75
Duda-Hart test:
dudahart2(n.kepler, kmeans_4$cluster)
## $p.value
## [1] 0
##
## $dh
## [1] 0.2322315
##
## $compare
## [1] 0.948305
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
Calinski-Harabasz test:
round(calinhara(n.kepler, kmeans_6$cluster),digits=2)
## [1] 970.84
Duda-Hart test:
dudahart2(n.kepler, kmeans_6$cluster)
## $p.value
## [1] 0
##
## $dh
## [1] 0.2530384
##
## $compare
## [1] 0.948305
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
Calinski-Harabasz test returns the highest value for 6 clusters but again the tendency towards splitting the clusters is something that should be avoided. Also p-value of Duda-Hart test is rounded to 0 which indicates that the clusters should be further splitted.
Checking other features of the previously generated sets of clusters. The size of the cluster shows how many observations are in each one. Centers are displayed for every variable.
kmeans_2$size
## [1] 6374 1620
kmeans_2$centers
## koi_score koi_period koi_time0bk koi_impact koi_duration koi_depth
## 1 0.2277246 0.05090485 0.05782602 -0.07942501 -0.07977698 -0.2321926
## 2 -0.8959977 -0.20028859 -0.22752040 0.31250310 0.31388796 0.9135775
## koi_prad koi_teq koi_insol koi_model_snr koi_steff koi_slogg
## 1 -0.06251983 -0.2474020 -0.04454498 -0.2348955 -0.1623106 0.2570398
## 2 0.24598850 0.9734199 0.17526524 0.9242122 0.6386222 -1.0113407
## koi_srad ra dec koi_kepmag
## 1 -0.1054412 -0.06731255 0.0368244 0.2052854
## 2 0.4148655 0.26484578 -0.1448881 -0.8077092
kmeans_3$size
## [1] 3289 576 4129
kmeans_3$centers
## koi_score koi_period koi_time0bk koi_impact koi_duration koi_depth
## 1 -0.8962839 0.1531647 0.07344351 0.2465609 0.1939113 -0.1608861
## 2 -0.9909086 -0.1332665 -0.13229453 0.1507168 0.3316375 2.9760454
## 3 0.8521776 -0.1034142 -0.04004699 -0.2174259 -0.2007259 -0.2870060
## koi_prad koi_teq koi_insol koi_model_snr koi_steff koi_slogg
## 1 0.03451850 0.4133696 0.06258261 -0.1284659 0.2948058 -0.3126399
## 2 0.36666875 0.1632098 -0.02685784 2.6908827 0.4566490 -0.1130649
## 3 -0.07864678 -0.3520420 -0.04610417 -0.2730501 -0.2985338 0.2648094
## koi_srad ra dec koi_kepmag
## 1 0.14419892 0.26281245 -0.14599939 -0.1548263
## 2 -0.04995128 0.09709507 -0.01545794 -0.1169405
## 3 -0.10789496 -0.22289099 0.11845381 0.1396419
kmeans_4$size
## [1] 566 566 4088 2774
kmeans_4$centers
## koi_score koi_period koi_time0bk koi_impact koi_duration koi_depth
## 1 -0.4824639 2.9881680 2.29558545 -0.01538958 1.62775758 -0.1791405
## 2 -0.9940845 -0.1956176 -0.26044958 0.14855305 0.31446363 2.9634507
## 3 0.8178903 -0.1714014 -0.09708649 -0.20940523 -0.21587554 -0.2834441
## 4 -0.9040408 -0.3171934 -0.27216919 0.28142685 -0.07815357 -0.1503965
## koi_prad koi_teq koi_insol koi_model_snr koi_steff koi_slogg
## 1 -0.03056573 -0.9100443 -0.04770831 -0.1624162 0.1320790 -0.1865300
## 2 0.36943889 0.1596011 -0.02718373 2.6988753 0.4489147 -0.1152114
## 3 -0.07792890 -0.3487283 -0.04623475 -0.2675338 -0.3042427 0.2691346
## 4 0.04569977 0.6670340 0.08341621 -0.1232724 0.3298132 -0.3350529
## koi_srad ra dec koi_kepmag
## 1 0.01743424 -0.1021614 0.05091131 -0.1161374
## 2 -0.04945405 0.1034862 -0.02285610 -0.1125274
## 3 -0.10807500 -0.2265458 0.11929185 0.1448955
## 4 0.16580166 0.3335867 -0.18152283 -0.1668740
kmeans_6$size
## [1] 3666 21 2297 1007 500 503
kmeans_6$centers
## koi_score koi_period koi_time0bk koi_impact koi_duration koi_depth
## 1 0.9986923 -0.16580372 -0.09371650 -0.22502758 -0.22224158 -0.29958714
## 2 -0.7192039 0.04883552 -0.04619534 1.29991088 -0.20531481 -0.23360722
## 3 -0.9788066 -0.27862083 -0.19454350 0.28926625 -0.09238015 -0.08715666
## 4 -0.6650187 -0.28858742 -0.30856810 0.15403408 0.05388176 -0.15599284
## 5 -0.9962382 -0.18905431 -0.23463221 0.01361026 0.35323595 3.10962783
## 6 -0.4572388 3.24440878 2.42434473 -0.05707565 1.59119190 -0.18755084
## koi_prad koi_teq koi_insol koi_model_snr koi_steff koi_slogg
## 1 -0.079999595 -0.34427737 -0.04618898 -0.27797351 -0.23398591 0.2461515
## 2 6.140066746 7.21549938 10.76903988 -0.30938461 -2.52542298 -8.4944042
## 3 -0.002218252 0.07721412 -0.04088840 -0.10688999 -0.06777443 0.3742486
## 4 0.080403631 1.32496961 0.07656940 -0.09803218 0.74920107 -1.4313671
## 5 0.217884643 0.14676540 -0.03192648 2.92917050 0.46527677 -0.1342484
## 6 -0.040708430 -0.94312644 -0.04779827 -0.18845443 0.15789309 -0.1493976
## koi_srad ra dec koi_kepmag
## 1 -0.10711396 -0.1893784 0.10840708 0.1118957
## 2 16.43718669 0.7367259 -0.19414232 -1.3842821
## 3 -0.12109929 0.1945341 -0.11077240 0.4547427
## 4 0.35760628 0.2353703 -0.16050033 -1.2627279
## 5 -0.04981071 0.1203382 -0.04405109 -0.1897004
## 6 -0.01896668 -0.1297055 0.08896663 -0.1178260
PAM is an implementation of the k-medoid method, a clustering technique that divides a dataset containing n objects into k clusters known in advance. The operation of this algorithm is similar to k-means, with the difference that the centroids are replaced by medoids, for which the distance from all other elements within a given group is minimal. This algorithm aims to minimize the sum of the distances of all non-medoid elements from the medoids closest to them.
In PAM the number of clusters has to be pre-determined, therefore some analysis has to be done beforehand. Below there is a comparison of the average silhouette width and total within the sum of square.
fviz_nbclust(n.kepler, FUNcluster = cluster::pam, method = c("silhouette"), k.max = 10, nboot = 100,)
fviz_nbclust(n.kepler, FUNcluster = cluster::pam, method = c("wss"), k.max = 10, nboot = 100,)
The silhouette width has the highest value for 4 clusters. Total within the sum of square similarly as for k-means indicates the more clusters, the better. For the further analysis 3, 4, and 7 clusters were chosen.
Below are plotted results of clustering for every k number of clusters.
pam_3 <- eclust(n.kepler, "pam", k=3, hc_metric="euclidean")
fviz_cluster(pam_3)
pam_4 <- eclust(n.kepler, "pam", k=4, hc_metric="euclidean")
fviz_cluster(pam_4)
pam_7 <- eclust(n.kepler, "pam", k=7, hc_metric="euclidean")
fviz_cluster(pam_7)
fviz_silhouette(pam_3)
## cluster size ave.sil.width
## 1 1 3926 0.34
## 2 2 3529 -0.03
## 3 3 539 0.11
fviz_silhouette(pam_4)
## cluster size ave.sil.width
## 1 1 3909 0.29
## 2 2 2869 0.05
## 3 3 692 0.07
## 4 4 524 0.11
fviz_silhouette(pam_7)
## cluster size ave.sil.width
## 1 1 1269 0.16
## 2 2 2003 0.10
## 3 3 943 -0.07
## 4 4 1117 0.12
## 5 5 647 0.04
## 6 6 501 0.10
## 7 7 1514 0.15
As predicted beforehand, the highest silhouette width is achieved for 4 clusters. The value for all numbers of clusters is relatively low and the goodness of fit is worse than for k-means.
Calinski-Harabasz test:
round(calinhara(n.kepler, pam_3$cluster),digits=2)
## [1] 841.83
Duda-Hart test:
dudahart2(n.kepler, pam_3$cluster)
## $p.value
## [1] 0
##
## $dh
## [1] 0.7337832
##
## $compare
## [1] 0.948305
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
Calinski-Harabasz test:
round(calinhara(n.kepler, pam_4$cluster),digits=2)
## [1] 839.54
Duda-Hart test:
dudahart2(n.kepler, pam_4$cluster)
## $p.value
## [1] 0
##
## $dh
## [1] 0.5082218
##
## $compare
## [1] 0.948305
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
Calinski-Harabasz test:
round(calinhara(n.kepler, pam_7$cluster),digits=2)
## [1] 654.08
Duda-Hart test:
dudahart2(n.kepler, pam_7$cluster)
## $p.value
## [1] 0
##
## $dh
## [1] 0.1327225
##
## $compare
## [1] 0.948305
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
The p-value of Duda-Hart test is rounded to 0, however, this case the value of Calinski-Harabasz test decreases with increasing number of clusters.
Gower distance is a measure that can be applied to all types of variables: numerical, categorical, logical or character. The value is always in a range from 0 to 1 and is calculated for ordinal variables - with Manhattan distance, and for nominal variables - converts them to binary and then applies Dice coefficient, and quantitative - range normalized Manhattan distance.
gower_2 <- pam(gower_dist, diss = TRUE, k = 2)
gower_3 <- pam(gower_dist, diss = TRUE, k = 3)
gower_4 <- pam(gower_dist, diss = TRUE, k = 4)
gower_5 <- pam(gower_dist, diss = TRUE, k = 5)
gower_6 <- pam(gower_dist, diss = TRUE, k = 6)
gower_7 <- pam(gower_dist, diss = TRUE, k = 7)
gower_8 <- pam(gower_dist, diss = TRUE, k = 8)
gower_9 <- pam(gower_dist, diss = TRUE, k = 9)
gower_10 <- pam(gower_dist, diss = TRUE, k = 10)
print(gower_2$silinfo$avg.width)
## [1] 0.5128724
print(gower_3$silinfo$avg.width)
## [1] 0.5245092
print(gower_4$silinfo$avg.width)
## [1] 0.5773557
print(gower_5$silinfo$avg.width)
## [1] 0.393436
print(gower_6$silinfo$avg.width)
## [1] 0.378501
print(gower_7$silinfo$avg.width)
## [1] 0.3837245
print(gower_8$silinfo$avg.width)
## [1] 0.3766602
print(gower_9$silinfo$avg.width)
## [1] 0.3296458
print(gower_10$silinfo$avg.width)
## [1] 0.2656934
Silhouette width achieves the biggest value for 4 clusters, therefore this will be the chosen number.
gower_4 <- pam(gower_dist, diss = TRUE, k=4)
fviz_silhouette(gower_4)
## cluster size ave.sil.width
## 1 1 4029 0.71
## 2 2 1762 0.51
## 3 3 1413 0.36
## 4 4 790 0.46
The obtained silhouette plot is the best of all clustering methods. There are very little values below zero and the average width is equal to 0.59.
Calinski-Harabasz test:
round(calinhara(n.kepler_mix, gower_2$cluster),digits=2)
## [1] 1318.68
Duda-Hart test:
dudahart2(n.kepler_mix, gower_2$cluster)
## $p.value
## [1] 0
##
## $dh
## [1] 0.8583695
##
## $compare
## [1] 0.9623101
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
Calinski-Harabasz test:
round(calinhara(n.kepler_mix, gower_3$cluster),digits=2)
## [1] 1002.34
Duda-Hart test:
dudahart2(n.kepler_mix, gower_3$cluster)
## $p.value
## [1] 0
##
## $dh
## [1] 0.6536341
##
## $compare
## [1] 0.9623101
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
Calinski-Harabasz test:
round(calinhara(n.kepler_mix, gower_4$cluster),digits=2)
## [1] 927.77
Duda-Hart test:
dudahart2(n.kepler_mix, gower_4$cluster)
## $p.value
## [1] 0
##
## $dh
## [1] 0.4927167
##
## $compare
## [1] 0.9623101
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
The Calinski-Harabasz test value decreases with chosen numbers of clusters, opposite to the silhouette width.
Before conducting Principal Component Analysis the data need to be tested. Two measures were chosen: Bartlett test and Kayser-Meyer-Olkin. Bartlett’s test of sphericity compares an observed correlation matrix to the identity matrix. KMO measures sampling adequacy for each variable in the model and for the complete model.
For continuous data:
cor_matrix <- cor(n.kepler)
cortest.bartlett(cor_matrix, n = 7994)
## $chisq
## [1] 34167.57
##
## $p.value
## [1] 0
##
## $df
## [1] 120
KMO(cor_matrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_matrix)
## Overall MSA = 0.58
## MSA for each item =
## koi_score koi_period koi_time0bk koi_impact koi_duration
## 0.53 0.53 0.64 0.48 0.59
## koi_depth koi_prad koi_teq koi_insol koi_model_snr
## 0.57 0.52 0.60 0.58 0.55
## koi_steff koi_slogg koi_srad ra dec
## 0.57 0.61 0.66 0.67 0.69
## koi_kepmag
## 0.63
For mixed data:
cor_matrix_mix <- cor(n.kepler_mix)
cortest.bartlett(cor_matrix_mix, n = 7994)
## $chisq
## [1] 92650.52
##
## $p.value
## [1] 0
##
## $df
## [1] 253
KMO(cor_matrix_mix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_matrix_mix)
## Overall MSA = 0.71
## MSA for each item =
## koi_disposition koi_pdisposition koi_score koi_fpflag_nt
## 0.87 0.73 0.78 0.52
## koi_fpflag_ss koi_fpflag_co koi_fpflag_ec koi_period
## 0.75 0.77 0.87 0.59
## koi_time0bk koi_impact koi_duration koi_depth
## 0.67 0.59 0.65 0.74
## koi_prad koi_teq koi_insol koi_model_snr
## 0.54 0.68 0.59 0.75
## koi_tce_plnt_num koi_steff koi_slogg koi_srad
## 0.88 0.63 0.62 0.66
## ra dec koi_kepmag
## 0.84 0.81 0.64
Checking for visible correlations between variables:
kepler.cor <-cor (kepler, method="pearson")
corrplot(kepler.cor, order ="alphabet", tl.cex=0.6)
Plot indicates several positive and few negative correlation tendencies. Basing on tests results and correlation plot it can be concluded that data is eligible for further dimension reduction.
Principal Component Analysis is one of the statistical methods of factor analysis. A dataset consisting of N observations, each containing K variables, can be interpreted as a cloud of N points in a K-dimensional space. The goal of PCA is to rotate the coordinate system in such a way as to maximize first the variance of the first coordinate, then the variance of the second coordinate, etc. The coordinate values transformed in this way are called the loads of the generated factors (principal components). In this way, a new observation space is constructed, in which the initial factors explain the most variability.
pca <- prcomp(n.kepler, center=FALSE, scale.=FALSE)
get_eigenvalue(pca)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.8711543 17.944714 17.94471
## Dim.2 2.0386946 12.741841 30.68656
## Dim.3 1.8693421 11.683388 42.36994
## Dim.4 1.4814177 9.258860 51.62880
## Dim.5 1.2606210 7.878881 59.50769
## Dim.6 1.0875519 6.797200 66.30489
## Dim.7 0.9852795 6.157997 72.46288
## Dim.8 0.8133554 5.083471 77.54635
## Dim.9 0.7687834 4.804896 82.35125
## Dim.10 0.6656214 4.160134 86.51138
## Dim.11 0.5456875 3.410547 89.92193
## Dim.12 0.4173867 2.608667 92.53060
## Dim.13 0.3796646 2.372904 94.90350
## Dim.14 0.3647637 2.279773 97.18327
## Dim.15 0.2755864 1.722415 98.90569
## Dim.16 0.1750898 1.094311 100.00000
fviz_eig(pca, addlabels=TRUE)
One of the criterion of choosing a number of dimensions to keep is eigenvalue. The values above show that 6 dimensions should be chosen. The other criterion is the percentage of variance explained that should be at least 5%. On the plot 8 dimensions explain enough variance, however the 8th dimension has eigenvalue around 0.81. For further analysis 7 variables explaining 72,5 % of variance should be selected.
fviz_pca_var(pca, col.var="cos2", alpha.var="contrib", gradient.cols = c("blue", "purple", "red"), repel = TRUE)
fviz_contrib(pca, choice = "var", axes = 1, top = 10)
fviz_contrib(pca, choice = "var", axes = 2, top = 10)
fviz_contrib(pca, choice = "var", axes = 3, top = 10)
fviz_contrib(pca, choice = "var", axes = 4, top = 10)
fviz_contrib(pca, choice = "var", axes = 5, top = 10)
fviz_contrib(pca, choice = "var", axes = 6, top = 10)
fviz_contrib(pca, choice = "var", axes = 7, top = 10)
Multidimensional scaling is a statistical technique aimed at detecting hidden variables that, although not directly observed, explain the similarities and differences between the studied objects.
There are many competing scaling techniques. The entry of the procedure is usually a matrix of distance or similarity between the objects. It may be, for example, a correlation matrix. Multidimensional scaling tends to arrange objects as points in n-dimensional space so that similar objects are closer. The result of the analysis is for each object n real numbers that can be understood as Cartesian coordinates.
dist.kepler <- dist(n.kepler)
mds <- cmdscale(dist.kepler, k=2)
summary(mds)
## V1 V2
## Min. :-3.2466 Min. :-14.8256
## 1st Qu.:-0.9089 1st Qu.: -0.4039
## Median :-0.2801 Median : 0.4027
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4809 3rd Qu.: 0.8566
## Max. :41.6389 Max. : 18.1694
Commentary
plot(mds)
The plot presents all variables reduced to only 2-D graph. It can be clearly observed that outliers are present that may affect results. If analytic tools are sensitive to outliers, the further analysis should be performed without those points to increase credibility.
Principal Component Analysis Of Mixed Data performs principal component analysis of a set of individuals (observations) described by a mixture of qualitative and quantitative variables. PCAmix includes ordinary principal component analysis (PCA) and multiple correspondence analysis (MCA) as special cases.
quali <- kepler_mix[, c(1, 2)]
quanti <- kepler
pca_mix <- PCAmix(X.quanti = quanti, X.quali = quali, ndim = 18, rename.level=TRUE, graph = FALSE)
pca_mix$eig
## Eigenvalue Proportion Cumulative
## dim 1 3.799400927 19.99684698 19.99685
## dim 2 2.481059712 13.05820901 33.05506
## dim 3 1.915053214 10.07922744 43.13428
## dim 4 1.484984540 7.81570811 50.94999
## dim 5 1.395797798 7.34630420 58.29630
## dim 6 1.260135492 6.63229206 64.92859
## dim 7 1.052730922 5.54068906 70.46928
## dim 8 0.984608610 5.18215058 75.65143
## dim 9 0.911799096 4.79894261 80.45037
## dim 10 0.791927709 4.16804057 84.61841
## dim 11 0.695691047 3.66153183 88.27994
## dim 12 0.550161241 2.89558548 91.17553
## dim 13 0.426418371 2.24430722 93.41984
## dim 14 0.378677759 1.99304084 95.41288
## dim 15 0.363847457 1.91498662 97.32786
## dim 16 0.281766117 1.48297956 98.81084
## dim 17 0.180412837 0.94954125 99.76038
## dim 18 0.035931957 0.18911556 99.94950
## dim 19 0.009595193 0.05050102 100.00000
In order to perform PCA for mixed data, the dataset had to be divided into to subsets - one containing qualitative variables and one containing only quantitative ones. The obtained eigenvalues indicate that the number of dimensions should be reduced to 7.
plot(pca_mix, choice="sqload",main="Squared correlations")
plot(pca_mix, choice="cor",main="Correlation circle")
Factor analysis of mixed data (FAMD) is another principal component method dedicated to analyze a data set containing both quantitative and qualitative variables. It makes it possible to analyze the similarity between individuals by taking into account a mixed types of variables. It is a mixture of PCA and MCA - it uses PCA for quantitative variables and MCA for qualitative.
famd <- FAMD(kepler_mix, ncp=23, graph=FALSE)
get_eigenvalue(famd)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.69275003 19.5531251 19.55313
## Dim.2 2.54753602 10.6147334 30.16786
## Dim.3 2.20675371 9.1948071 39.36267
## Dim.4 2.06657965 8.6107485 47.97341
## Dim.5 1.48899861 6.2041609 54.17758
## Dim.6 1.31512006 5.4796669 59.65724
## Dim.7 1.13605632 4.7335680 64.39081
## Dim.8 0.98960099 4.1233375 68.51415
## Dim.9 0.95856903 3.9940376 72.50819
## Dim.10 0.89498788 3.7291162 76.23730
## Dim.11 0.84396658 3.5165274 79.75383
## Dim.12 0.78099437 3.2541432 83.00797
## Dim.13 0.69039399 2.8766416 85.88461
## Dim.14 0.63917734 2.6632389 88.54785
## Dim.15 0.53479002 2.2282917 90.77614
## Dim.16 0.43074457 1.7947690 92.57091
## Dim.17 0.40137921 1.6724134 94.24333
## Dim.18 0.37493036 1.5622098 95.80554
## Dim.19 0.34684451 1.4451854 97.25072
## Dim.20 0.27389069 1.1412112 98.39193
## Dim.21 0.17700331 0.7375138 99.12945
## Dim.22 0.16443456 0.6851440 99.81459
## Dim.23 0.03515212 0.1464672 99.96106
fviz_eig(famd, ncp=10, addlabels=TRUE)
fviz_famd_var(famd, col.var="cos2", alpha.var="contrib", gradient.cols = c("blue", "purple", "red"), repel = TRUE)
According to eigenvalues the dimesions should be reduced to the amount of 7, however, 7th dimension does not explain more than 5% of variance. For further analysis 6 dimensions will be chosen what explains almost 60% of variance which is acceptable.
fviz_contrib(famd, choice = "var", axes = 1, top = 10)
fviz_contrib(famd, choice = "var", axes = 2, top = 10)
fviz_contrib(famd, choice = "var", axes = 3, top = 10)
fviz_contrib(famd, choice = "var", axes = 4, top = 10)
fviz_contrib(famd, choice = "var", axes = 5, top = 10)
fviz_contrib(famd, choice = "var", axes = 6, top = 10)
To conclued, this paper aimed to present several method for clustering and dimension reduction of mixed data type set. K-means and PAM were chosen for the continuous variables, however they did not deliver acceptable results. The best quality of clusters was achieved by the PAM clustering using Gower distance on a mixed data set. For reducing dimensions PCA and MDS methods were chosen for continuous variables, and FAMD and PCAmix for mixed variables. The overall conclusion is that in order to achieve the best significance, robustness and quality of analysis several methods need to be compared and tested.