Kepler’s Exoplanets

Aleksandra Tomczak

1st March 2021

NASA’s Kepler Space Telescope was an observatory in space dedicated to finding planet systems that my resemble ours. It has been collecting data for 9 years starting from 2009 up to 2018. Since its launch telescope discovered thousands of stars, extra-solar planets and exoplanets.

The analyzed dataset was published as-is by NASA. It is a cumulative record of all observed obejcts by Kepler (up to 2016) - confirmed exoplanets or candidates. To find more about the other variables and the topic itself please visit CalTech site or see original table here.

## 'data.frame':    9564 obs. of  50 variables:
##  $ rowid            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ kepid            : int  10797460 10797460 10811496 10848459 10854555 10872983 10872983 10872983 6721123 10910878 ...
##  $ kepoi_name       : Factor w/ 9564 levels "K00001.01","K00002.01",..: 1081 1082 1083 1084 1085 1086 1087 1088 108 1089 ...
##  $ kepler_name      : Factor w/ 2295 levels "","Kepler-1 b",..: 1037 1038 1 1 1869 1041 1040 1039 1 1043 ...
##  $ koi_disposition  : Factor w/ 3 levels "CANDIDATE","CONFIRMED",..: 2 2 3 3 2 2 2 2 3 2 ...
##  $ koi_pdisposition : Factor w/ 2 levels "CANDIDATE","FALSE POSITIVE": 1 1 2 2 1 1 1 1 2 1 ...
##  $ koi_score        : num  1 0.969 0 0 1 1 1 0.992 0 1 ...
##  $ koi_fpflag_nt    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ koi_fpflag_ss    : int  0 0 1 1 0 0 0 0 1 0 ...
##  $ koi_fpflag_co    : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ koi_fpflag_ec    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ koi_period       : num  9.49 54.42 19.9 1.74 2.53 ...
##  $ koi_period_err1  : num  2.78e-05 2.48e-04 1.49e-05 2.63e-07 3.76e-06 ...
##  $ koi_period_err2  : num  -2.78e-05 -2.48e-04 -1.49e-05 -2.63e-07 -3.76e-06 ...
##  $ koi_time0bk      : num  171 163 176 170 172 ...
##  $ koi_time0bk_err1 : num  0.00216 0.00352 0.000581 0.000115 0.00113 0.00141 0.0019 0.00461 0.00253 0.000517 ...
##  $ koi_time0bk_err2 : num  -0.00216 -0.00352 -0.000581 -0.000115 -0.00113 -0.00141 -0.0019 -0.00461 -0.00253 -0.000517 ...
##  $ koi_impact       : num  0.146 0.586 0.969 1.276 0.701 ...
##  $ koi_impact_err1  : num  0.318 0.059 5.126 0.115 0.235 ...
##  $ koi_impact_err2  : num  -0.146 -0.443 -0.077 -0.092 -0.478 -0.428 -0.532 -0.523 -0.044 -0.052 ...
##  $ koi_duration     : num  2.96 4.51 1.78 2.41 1.65 ...
##  $ koi_duration_err1: num  0.0819 0.116 0.0341 0.00537 0.042 0.061 0.0673 0.165 0.136 0.0241 ...
##  $ koi_duration_err2: num  -0.0819 -0.116 -0.0341 -0.00537 -0.042 -0.061 -0.0673 -0.165 -0.136 -0.0241 ...
##  $ koi_depth        : num  616 875 10829 8079 603 ...
##  $ koi_depth_err1   : num  19.5 35.5 171 12.8 16.9 24.2 18.7 16.8 5.8 33.3 ...
##  $ koi_depth_err2   : num  -19.5 -35.5 -171 -12.8 -16.9 -24.2 -18.7 -16.8 -5.8 -33.3 ...
##  $ koi_prad         : num  2.26 2.83 14.6 33.46 2.75 ...
##  $ koi_prad_err1    : num  0.26 0.32 3.92 8.5 0.88 1.27 0.9 0.52 6.45 0.22 ...
##  $ koi_prad_err2    : num  -0.15 -0.19 -1.31 -2.83 -0.35 -0.42 -0.3 -0.17 -9.67 -0.49 ...
##  $ koi_teq          : num  793 443 638 1395 1406 ...
##  $ koi_teq_err1     : logi  NA NA NA NA NA NA ...
##  $ koi_teq_err2     : logi  NA NA NA NA NA NA ...
##  $ koi_insol        : num  93.59 9.11 39.3 891.96 926.16 ...
##  $ koi_insol_err1   : num  29.45 2.87 31.04 668.95 874.33 ...
##  $ koi_insol_err2   : num  -16.65 -1.62 -10.49 -230.35 -314.24 ...
##  $ koi_model_snr    : num  35.8 25.8 76.3 505.6 40.9 ...
##  $ koi_tce_plnt_num : int  1 2 1 1 1 1 2 3 1 1 ...
##  $ koi_tce_delivname: Factor w/ 4 levels "","q1_q16_tce",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ koi_steff        : num  5455 5455 5853 5805 6031 ...
##  $ koi_steff_err1   : num  81 81 158 157 169 189 189 189 111 75 ...
##  $ koi_steff_err2   : num  -81 -81 -176 -174 -211 -232 -232 -232 -124 -83 ...
##  $ koi_slogg        : num  4.47 4.47 4.54 4.56 4.44 ...
##  $ koi_slogg_err1   : num  0.064 0.064 0.044 0.053 0.07 0.054 0.054 0.054 0.182 0.083 ...
##  $ koi_slogg_err2   : num  -0.096 -0.096 -0.176 -0.168 -0.21 -0.229 -0.229 -0.229 -0.098 -0.028 ...
##  $ koi_srad         : num  0.927 0.927 0.868 0.791 1.046 ...
##  $ koi_srad_err1    : num  0.105 0.105 0.233 0.201 0.334 0.315 0.315 0.315 0.322 0.033 ...
##  $ koi_srad_err2    : num  -0.061 -0.061 -0.078 -0.067 -0.133 -0.105 -0.105 -0.105 -0.483 -0.072 ...
##  $ ra               : num  292 292 297 286 289 ...
##  $ dec              : num  48.1 48.1 48.1 48.3 48.2 ...
##  $ koi_kepmag       : num  15.3 15.3 15.4 15.6 15.5 ...

As clearly visible, data is of a mixed type and, therefore, the analysis will be divided into two parts - one for continuous data type and one for mixed data types.

1. Data Preparation

Original dataframe consists of 50 variables and almost 10 000 observed exoplanets (KOI - Kepler Objects of Interest). This dataset has an extensive data dictionary, which can be accessed here. Highlightable columns of note are:

  • koi_disposition: The disposition in the literature towards this exoplanet candidate. One of CANDIDATE, FALSE POSITIVE, NOT DISPOSITIONED or CONFIRMED.
  • koi_pdisposition: The disposition Kepler data analysis has towards this exoplanet candidate. One of FALSE POSITIVE, NOT DISPOSITIONED, and CANDIDATE.
  • koi_score: A value between 0 and 1 that indicates the confidence in the KOI disposition. For CANDIDATEs, a higher value indicates more confidence in its disposition, while for FALSE POSITIVEs, a higher value indicates less confidence in that disposition.
  • koi_period: The interval between consecutive planetary transits.
  • koi_impact: The sky-projected distance between the center of the stellar disc and the center of the planet disc at conjunction, normalized by the stellar radius.
  • koi_duration: The duration of the observed transits. Duration is measured from first contact between the planet and star until last contact.
  • koi_prad: The radius of the planet. Planetary radius is the product of the planet star radius ratio and the stellar radius.
  • koi_teq: Approximation for the temperature of the planet.

Besides abovementioned variables, the dataset contains measurement error variables that are not necessary for the cluster analysis. Those columns were deleted together with ID columns. The rows containing missing values were also removed, since when it comes to planets’ features it is impossible to replace those values with some statistics. Reduced dataset contains 23 variables and almost 8 000 observations.

Before further analysis both mixed and continuous datasets were standardized.

## Warning in data.Normalization(kepler_mix, type = "n1", normalization =
## "column"): Data not numeric, normalization not applicable

## Warning in data.Normalization(kepler_mix, type = "n1", normalization =
## "column"): Data not numeric, normalization not applicable

2. Prediagnostics I

To determine if given dataset is clusterable, three tests were performed for both continues and mixed data. Hopkins statistic measures probability that given data was generated by a uniform distribution. Another taken measure was Dip test that compares the empirical distribution to a uniform density and rejects the assumption of unimodality if the data is sufficiently different from the closest possible uniform distribution. The last test performed was a Silverman test. This test is based on kernel density and approximates given distribution with a set of Gaussians of a specified bandwidth.

hopkins(n.kepler, n=100)
## $H
## [1] 0.0217199
clusterabilitytest(n.kepler, "dip",
                   reduction = "none",
                   distance_metric = "euclidean",
                   distance_standardize = "none",
                   is_dist_matrix = FALSE,
                   completecase = FALSE,
                   d_simulatepvalue = FALSE)
## n = 127904 > max_n{n in table} = 72000 -- using that as asymptotic value.
## ----------------------
## Clusterability Test
## ----------------------
## 
## Data set name: n.kepler
## Your data set has 7994 observation(s) and 16 variable(s).
## There were no missing values. Your data set is complete.
## 
## 
## -----------------------------------------
## Results: Dip Test of Unimodality
## -----------------------------------------
## 
## Null Hypothesis: number of modes = 1
## Alternative Hypothesis: number of modes > 1
## p-value: 0 
## Dip statistic: 0.0317236193826746 
## 
## ---------------------
## Test Options Used
## ---------------------
## 
## Default values for the optional parameters were used. To learn more about customizing the behavior of the clusterabilitytest, please see the R documentation.
clusterabilitytest(n.kepler, "silverman",
                   reduction = "none",
                   distance_metric = "euclidean",
                   distance_standardize = "none",
                   is_dist_matrix = FALSE,
                   completecase = FALSE,
                   d_simulatepvalue = FALSE,
                   s_m = 999, s_adjust = TRUE)
## ----------------------
## Clusterability Test
## ----------------------
## 
## Data set name: n.kepler
## Your data set has 7994 observation(s) and 16 variable(s).
## There were no missing values. Your data set is complete.
## 
## 
## ----------------------------------------------
## Results: Silverman's Critical Bandwidth Test
## ----------------------------------------------
## 
## Null Hypothesis: number of modes <= 1 
## Alternative Hypothesis: number of modes > 1 
## p-value: 0 
## Critical bandwidth: 11.060238 
## 
## ---------------------
## Test Options Used
## ---------------------
## 
## p value based on 999 bootstrap replicates
## Adjusted p-values based on the adjustment in Hall and York (2001)
## p value rounded to: 6 digits
hopkins(n.kepler_mix, n=100)
## $H
## [1] 0.02603993
clusterabilitytest(n.kepler_mix, "dip",
                   reduction = "none",
                   distance_metric = "euclidean",
                   distance_standardize = "none",
                   is_dist_matrix = FALSE,
                   completecase = FALSE,
                   d_simulatepvalue = FALSE)
## n = 183862 > max_n{n in table} = 72000 -- using that as asymptotic value.
## ----------------------
## Clusterability Test
## ----------------------
## 
## Data set name: n.kepler_mix
## Your data set has 7994 observation(s) and 23 variable(s).
## There were no missing values. Your data set is complete.
## 
## 
## -----------------------------------------
## Results: Dip Test of Unimodality
## -----------------------------------------
## 
## Null Hypothesis: number of modes = 1
## Alternative Hypothesis: number of modes > 1
## p-value: 0 
## Dip statistic: 0.0246940455077549 
## 
## ---------------------
## Test Options Used
## ---------------------
## 
## Default values for the optional parameters were used. To learn more about customizing the behavior of the clusterabilitytest, please see the R documentation.
clusterabilitytest(n.kepler_mix, "silverman",
                   reduction = "none",
                   distance_metric = "euclidean",
                   distance_standardize = "none",
                   is_dist_matrix = FALSE,
                   completecase = FALSE,
                   d_simulatepvalue = FALSE,
                   s_m = 999, s_adjust = TRUE)
## ----------------------
## Clusterability Test
## ----------------------
## 
## Data set name: n.kepler_mix
## Your data set has 7994 observation(s) and 23 variable(s).
## There were no missing values. Your data set is complete.
## 
## 
## ----------------------------------------------
## Results: Silverman's Critical Bandwidth Test
## ----------------------------------------------
## 
## Null Hypothesis: number of modes <= 1 
## Alternative Hypothesis: number of modes > 1 
## p-value: 0 
## Critical bandwidth: 11.030802 
## 
## ---------------------
## Test Options Used
## ---------------------
## 
## p value based on 999 bootstrap replicates
## Adjusted p-values based on the adjustment in Hall and York (2001)
## p value rounded to: 6 digits

For both data sets result of Hopkins test is close to 0, so the null hypothesis is rejected, which means that data came from multiple distributions. Silverman and Dip tests returned p-value rounded to 0, which indicates that null hypothesis for these tests also should be rejected for both datasets and they could be clustered.

3. Clustering

3.1. Continuous Variables

3.1.1. K-Means

The centroid (k-means) algorithm is one of the most popular algorithms used in cluster analysis, used also in vector quantization. It aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This algorithm scales to large data sets and guarantees convergence, however, can be performed only on continuous data.

3.1.1.1. Optimal Number of Clusters

For the k-means algorithm the number of clusters has to be defined beforehand, so to make the most accurate choice several methods are used to determine k.

Silhouette I
fviz_nbclust(n.kepler, FUNcluster = kmeans, method = c("silhouette"), k.max = 10, nboot = 100,)

Sum of Square
fviz_nbclust(n.kepler, FUNcluster = kmeans, method = c("wss"), k.max = 10, nboot = 100,)

Elbow
Optimal_Clusters_KMeans(n.kepler, max_clusters=10, plot_clusters = TRUE)

##  [1] 1.0000000 0.9121733 0.8188333 0.7557514 0.6666948 0.6411550 0.5976459
##  [8] 0.5752367 0.5594080 0.5449191
## attr(,"class")
## [1] "k-means clustering"
Silhouette II
Optimal_Clusters_KMeans(n.kepler, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")

##  [1] 0.0000000 0.2414085 0.1399263 0.1347962 0.1162944 0.1340886 0.1206460
##  [8] 0.1087086 0.1066141 0.1065465
## attr(,"class")
## [1] "k-means clustering"

The highest silhouette width is visible for 2 clusters. The other graphs do not show clearly the particular number, just that the more clusters, the better, however, big amount of clusters is not a recommended. For the further analysis 2, 3, 4 and 6 clusters were chosen

3.1.1.2. Results

Compared results of different clusters numbers.

2 Clusters
kmeans_2 <- eclust(n.kepler, "kmeans", hc_metric="euclidean", k=2)

fviz_cluster(kmeans_2, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

3 Clusters
kmeans_3 <- eclust(n.kepler, "kmeans", hc_metric="euclidean", k=3)

fviz_cluster(kmeans_3, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

4 Clusters
kmeans_4 <- eclust(n.kepler, "kmeans", hc_metric="euclidean", k=4)

fviz_cluster(kmeans_4, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

6 Clusters
kmeans_6 <- eclust(n.kepler, "kmeans", hc_metric="euclidean", k=6)

fviz_cluster(kmeans_6, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

3.1.1.3. Average Silhouette Width

Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. It ranges from -1 to 1, and the highest value, the better fit of the data to the cluster.

2 Clusters
fviz_silhouette(kmeans_2)
##   cluster size ave.sil.width
## 1       1 6374          0.39
## 2       2 1620         -0.09

3 Clusters
fviz_silhouette(kmeans_3)
##   cluster size ave.sil.width
## 1       1 3289         -0.05
## 2       2  576          0.12
## 3       3 4129          0.33

4 Clusters
fviz_silhouette(kmeans_4)
##   cluster size ave.sil.width
## 1       1  566          0.10
## 2       2  566          0.11
## 3       3 4088          0.29
## 4       4 2774          0.03

6 Clusters
fviz_silhouette(kmeans_6)
##   cluster size ave.sil.width
## 1       1 3666          0.24
## 2       2   21         -0.03
## 3       3 2297          0.14
## 4       4 1007          0.03
## 5       5  500          0.13
## 6       6  503          0.13

Silhouette width achieves the highest value for 2 clusters, as was predicted on the graph beforehand. The value is equal to 0.29 which is relatively low. The first cluster is massive while the second one achieves many values below 0. The other silhouettes widths are even lower and the clustering quality is worse. The best fit graphucally is achieved for 6 clusters due to little values below 0.

3.1.1.4. Quality Measures

2 Clusters

To evaluate the quality of clustering two test measures were chosen:

Calinski-Harabasz test:

round(calinhara(n.kepler, kmeans_2$cluster),digits=2)
## [1] 867.7

Duda-Hart test:

dudahart2(n.kepler, kmeans_2$cluster)
## $p.value
## [1] 0
## 
## $dh
## [1] 0.9020619
## 
## $compare
## [1] 0.948305
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232
3 Clusters

Calinski-Harabasz test:

round(calinhara(n.kepler, kmeans_3$cluster),digits=2)
## [1] 769.6

Duda-Hart test:

dudahart2(n.kepler, kmeans_3$cluster)
## $p.value
## [1] 0
## 
## $dh
## [1] 0.6521784
## 
## $compare
## [1] 0.948305
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232
4 Clusters

Calinski-Harabasz test:

round(calinhara(n.kepler, kmeans_4$cluster),digits=2)
## [1] 860.75

Duda-Hart test:

dudahart2(n.kepler, kmeans_4$cluster)
## $p.value
## [1] 0
## 
## $dh
## [1] 0.2322315
## 
## $compare
## [1] 0.948305
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232
6 Clusters

Calinski-Harabasz test:

round(calinhara(n.kepler, kmeans_6$cluster),digits=2)
## [1] 970.84

Duda-Hart test:

dudahart2(n.kepler, kmeans_6$cluster)
## $p.value
## [1] 0
## 
## $dh
## [1] 0.2530384
## 
## $compare
## [1] 0.948305
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232

Calinski-Harabasz test returns the highest value for 6 clusters but again the tendency towards splitting the clusters is something that should be avoided. Also p-value of Duda-Hart test is rounded to 0 which indicates that the clusters should be further splitted.

3.1.1.5. Other Postdiagnostics

Checking other features of the previously generated sets of clusters. The size of the cluster shows how many observations are in each one. Centers are displayed for every variable.

2 Clusters
kmeans_2$size
## [1] 6374 1620
kmeans_2$centers
##    koi_score  koi_period koi_time0bk  koi_impact koi_duration  koi_depth
## 1  0.2277246  0.05090485  0.05782602 -0.07942501  -0.07977698 -0.2321926
## 2 -0.8959977 -0.20028859 -0.22752040  0.31250310   0.31388796  0.9135775
##      koi_prad    koi_teq   koi_insol koi_model_snr  koi_steff  koi_slogg
## 1 -0.06251983 -0.2474020 -0.04454498    -0.2348955 -0.1623106  0.2570398
## 2  0.24598850  0.9734199  0.17526524     0.9242122  0.6386222 -1.0113407
##     koi_srad          ra        dec koi_kepmag
## 1 -0.1054412 -0.06731255  0.0368244  0.2052854
## 2  0.4148655  0.26484578 -0.1448881 -0.8077092
3 Clusters
kmeans_3$size
## [1] 3289  576 4129
kmeans_3$centers
##    koi_score koi_period koi_time0bk koi_impact koi_duration  koi_depth
## 1 -0.8962839  0.1531647  0.07344351  0.2465609    0.1939113 -0.1608861
## 2 -0.9909086 -0.1332665 -0.13229453  0.1507168    0.3316375  2.9760454
## 3  0.8521776 -0.1034142 -0.04004699 -0.2174259   -0.2007259 -0.2870060
##      koi_prad    koi_teq   koi_insol koi_model_snr  koi_steff  koi_slogg
## 1  0.03451850  0.4133696  0.06258261    -0.1284659  0.2948058 -0.3126399
## 2  0.36666875  0.1632098 -0.02685784     2.6908827  0.4566490 -0.1130649
## 3 -0.07864678 -0.3520420 -0.04610417    -0.2730501 -0.2985338  0.2648094
##      koi_srad          ra         dec koi_kepmag
## 1  0.14419892  0.26281245 -0.14599939 -0.1548263
## 2 -0.04995128  0.09709507 -0.01545794 -0.1169405
## 3 -0.10789496 -0.22289099  0.11845381  0.1396419
4 Clusters
kmeans_4$size
## [1]  566  566 4088 2774
kmeans_4$centers
##    koi_score koi_period koi_time0bk  koi_impact koi_duration  koi_depth
## 1 -0.4824639  2.9881680  2.29558545 -0.01538958   1.62775758 -0.1791405
## 2 -0.9940845 -0.1956176 -0.26044958  0.14855305   0.31446363  2.9634507
## 3  0.8178903 -0.1714014 -0.09708649 -0.20940523  -0.21587554 -0.2834441
## 4 -0.9040408 -0.3171934 -0.27216919  0.28142685  -0.07815357 -0.1503965
##      koi_prad    koi_teq   koi_insol koi_model_snr  koi_steff  koi_slogg
## 1 -0.03056573 -0.9100443 -0.04770831    -0.1624162  0.1320790 -0.1865300
## 2  0.36943889  0.1596011 -0.02718373     2.6988753  0.4489147 -0.1152114
## 3 -0.07792890 -0.3487283 -0.04623475    -0.2675338 -0.3042427  0.2691346
## 4  0.04569977  0.6670340  0.08341621    -0.1232724  0.3298132 -0.3350529
##      koi_srad         ra         dec koi_kepmag
## 1  0.01743424 -0.1021614  0.05091131 -0.1161374
## 2 -0.04945405  0.1034862 -0.02285610 -0.1125274
## 3 -0.10807500 -0.2265458  0.11929185  0.1448955
## 4  0.16580166  0.3335867 -0.18152283 -0.1668740
6 Clusters
kmeans_6$size
## [1] 3666   21 2297 1007  500  503
kmeans_6$centers
##    koi_score  koi_period koi_time0bk  koi_impact koi_duration   koi_depth
## 1  0.9986923 -0.16580372 -0.09371650 -0.22502758  -0.22224158 -0.29958714
## 2 -0.7192039  0.04883552 -0.04619534  1.29991088  -0.20531481 -0.23360722
## 3 -0.9788066 -0.27862083 -0.19454350  0.28926625  -0.09238015 -0.08715666
## 4 -0.6650187 -0.28858742 -0.30856810  0.15403408   0.05388176 -0.15599284
## 5 -0.9962382 -0.18905431 -0.23463221  0.01361026   0.35323595  3.10962783
## 6 -0.4572388  3.24440878  2.42434473 -0.05707565   1.59119190 -0.18755084
##       koi_prad     koi_teq   koi_insol koi_model_snr   koi_steff  koi_slogg
## 1 -0.079999595 -0.34427737 -0.04618898   -0.27797351 -0.23398591  0.2461515
## 2  6.140066746  7.21549938 10.76903988   -0.30938461 -2.52542298 -8.4944042
## 3 -0.002218252  0.07721412 -0.04088840   -0.10688999 -0.06777443  0.3742486
## 4  0.080403631  1.32496961  0.07656940   -0.09803218  0.74920107 -1.4313671
## 5  0.217884643  0.14676540 -0.03192648    2.92917050  0.46527677 -0.1342484
## 6 -0.040708430 -0.94312644 -0.04779827   -0.18845443  0.15789309 -0.1493976
##      koi_srad         ra         dec koi_kepmag
## 1 -0.10711396 -0.1893784  0.10840708  0.1118957
## 2 16.43718669  0.7367259 -0.19414232 -1.3842821
## 3 -0.12109929  0.1945341 -0.11077240  0.4547427
## 4  0.35760628  0.2353703 -0.16050033 -1.2627279
## 5 -0.04981071  0.1203382 -0.04405109 -0.1897004
## 6 -0.01896668 -0.1297055  0.08896663 -0.1178260

3.1.2. PAM

PAM is an implementation of the k-medoid method, a clustering technique that divides a dataset containing n objects into k clusters known in advance. The operation of this algorithm is similar to k-means, with the difference that the centroids are replaced by medoids, for which the distance from all other elements within a given group is minimal. This algorithm aims to minimize the sum of the distances of all non-medoid elements from the medoids closest to them.

3.1.2.1. Optimal Number of Clusters

In PAM the number of clusters has to be pre-determined, therefore some analysis has to be done beforehand. Below there is a comparison of the average silhouette width and total within the sum of square.

Silhouette
fviz_nbclust(n.kepler, FUNcluster = cluster::pam, method = c("silhouette"), k.max = 10, nboot = 100,)

Sum of Square
fviz_nbclust(n.kepler, FUNcluster = cluster::pam, method = c("wss"), k.max = 10, nboot = 100,)

The silhouette width has the highest value for 4 clusters. Total within the sum of square similarly as for k-means indicates the more clusters, the better. For the further analysis 3, 4, and 7 clusters were chosen.

3.1.2.2. Results

Below are plotted results of clustering for every k number of clusters.

3 Clusters
pam_3 <- eclust(n.kepler, "pam", k=3, hc_metric="euclidean")

fviz_cluster(pam_3)

4 Clusters
pam_4 <- eclust(n.kepler, "pam", k=4, hc_metric="euclidean")

fviz_cluster(pam_4)

7 Clusters
pam_7 <- eclust(n.kepler, "pam", k=7, hc_metric="euclidean")

fviz_cluster(pam_7)

3.1.2.3. Average Silhouette Width

3 Clusters
fviz_silhouette(pam_3)
##   cluster size ave.sil.width
## 1       1 3926          0.34
## 2       2 3529         -0.03
## 3       3  539          0.11

4 Clusters
fviz_silhouette(pam_4)
##   cluster size ave.sil.width
## 1       1 3909          0.29
## 2       2 2869          0.05
## 3       3  692          0.07
## 4       4  524          0.11

7 Clusters
fviz_silhouette(pam_7)
##   cluster size ave.sil.width
## 1       1 1269          0.16
## 2       2 2003          0.10
## 3       3  943         -0.07
## 4       4 1117          0.12
## 5       5  647          0.04
## 6       6  501          0.10
## 7       7 1514          0.15

As predicted beforehand, the highest silhouette width is achieved for 4 clusters. The value for all numbers of clusters is relatively low and the goodness of fit is worse than for k-means.

3.1.2.4. Quality Measures

3 Clusters

Calinski-Harabasz test:

round(calinhara(n.kepler, pam_3$cluster),digits=2)
## [1] 841.83

Duda-Hart test:

dudahart2(n.kepler, pam_3$cluster)
## $p.value
## [1] 0
## 
## $dh
## [1] 0.7337832
## 
## $compare
## [1] 0.948305
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232
4 Clusters

Calinski-Harabasz test:

round(calinhara(n.kepler, pam_4$cluster),digits=2)
## [1] 839.54

Duda-Hart test:

dudahart2(n.kepler, pam_4$cluster)
## $p.value
## [1] 0
## 
## $dh
## [1] 0.5082218
## 
## $compare
## [1] 0.948305
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232
7 Clusters

Calinski-Harabasz test:

round(calinhara(n.kepler, pam_7$cluster),digits=2)
## [1] 654.08

Duda-Hart test:

dudahart2(n.kepler, pam_7$cluster)
## $p.value
## [1] 0
## 
## $dh
## [1] 0.1327225
## 
## $compare
## [1] 0.948305
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232

The p-value of Duda-Hart test is rounded to 0, however, this case the value of Calinski-Harabasz test decreases with increasing number of clusters.

3.2. Mixed Variables

3.2.1. Gower Distance

Gower distance is a measure that can be applied to all types of variables: numerical, categorical, logical or character. The value is always in a range from 0 to 1 and is calculated for ordinal variables - with Manhattan distance, and for nominal variables - converts them to binary and then applies Dice coefficient, and quantitative - range normalized Manhattan distance.

3.2.1.1. Optimal Number of Clusters

gower_2 <- pam(gower_dist, diss = TRUE, k = 2)
gower_3 <- pam(gower_dist, diss = TRUE, k = 3)
gower_4 <- pam(gower_dist, diss = TRUE, k = 4)
gower_5 <- pam(gower_dist, diss = TRUE, k = 5)
gower_6 <- pam(gower_dist, diss = TRUE, k = 6)
gower_7 <- pam(gower_dist, diss = TRUE, k = 7)
gower_8 <- pam(gower_dist, diss = TRUE, k = 8)
gower_9 <- pam(gower_dist, diss = TRUE, k = 9)
gower_10 <- pam(gower_dist, diss = TRUE, k = 10)
print(gower_2$silinfo$avg.width)
## [1] 0.5128724
print(gower_3$silinfo$avg.width)
## [1] 0.5245092
print(gower_4$silinfo$avg.width)
## [1] 0.5773557
print(gower_5$silinfo$avg.width)
## [1] 0.393436
print(gower_6$silinfo$avg.width)
## [1] 0.378501
print(gower_7$silinfo$avg.width)
## [1] 0.3837245
print(gower_8$silinfo$avg.width)
## [1] 0.3766602
print(gower_9$silinfo$avg.width)
## [1] 0.3296458
print(gower_10$silinfo$avg.width)
## [1] 0.2656934

Silhouette width achieves the biggest value for 4 clusters, therefore this will be the chosen number.

3.2.1.2. Results

gower_4 <- pam(gower_dist, diss = TRUE, k=4)

fviz_silhouette(gower_4)
##   cluster size ave.sil.width
## 1       1 4029          0.71
## 2       2 1762          0.51
## 3       3 1413          0.36
## 4       4  790          0.46

The obtained silhouette plot is the best of all clustering methods. There are very little values below zero and the average width is equal to 0.59.

3.2.1.3. Postdiagnostics

2 Clusters

Calinski-Harabasz test:

round(calinhara(n.kepler_mix, gower_2$cluster),digits=2)
## [1] 1318.68

Duda-Hart test:

dudahart2(n.kepler_mix, gower_2$cluster)
## $p.value
## [1] 0
## 
## $dh
## [1] 0.8583695
## 
## $compare
## [1] 0.9623101
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232
3 Clusters

Calinski-Harabasz test:

round(calinhara(n.kepler_mix, gower_3$cluster),digits=2)
## [1] 1002.34

Duda-Hart test:

dudahart2(n.kepler_mix, gower_3$cluster)
## $p.value
## [1] 0
## 
## $dh
## [1] 0.6536341
## 
## $compare
## [1] 0.9623101
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232
4 Clusters

Calinski-Harabasz test:

round(calinhara(n.kepler_mix, gower_4$cluster),digits=2)
## [1] 927.77

Duda-Hart test:

dudahart2(n.kepler_mix, gower_4$cluster)
## $p.value
## [1] 0
## 
## $dh
## [1] 0.4927167
## 
## $compare
## [1] 0.9623101
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232

The Calinski-Harabasz test value decreases with chosen numbers of clusters, opposite to the silhouette width.

4. Prediagnostics II

Before conducting Principal Component Analysis the data need to be tested. Two measures were chosen: Bartlett test and Kayser-Meyer-Olkin. Bartlett’s test of sphericity compares an observed correlation matrix to the identity matrix. KMO measures sampling adequacy for each variable in the model and for the complete model.

For continuous data:

cor_matrix <- cor(n.kepler)
cortest.bartlett(cor_matrix, n = 7994)
## $chisq
## [1] 34167.57
## 
## $p.value
## [1] 0
## 
## $df
## [1] 120
KMO(cor_matrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_matrix)
## Overall MSA =  0.58
## MSA for each item = 
##     koi_score    koi_period   koi_time0bk    koi_impact  koi_duration 
##          0.53          0.53          0.64          0.48          0.59 
##     koi_depth      koi_prad       koi_teq     koi_insol koi_model_snr 
##          0.57          0.52          0.60          0.58          0.55 
##     koi_steff     koi_slogg      koi_srad            ra           dec 
##          0.57          0.61          0.66          0.67          0.69 
##    koi_kepmag 
##          0.63

For mixed data:

cor_matrix_mix <- cor(n.kepler_mix)
cortest.bartlett(cor_matrix_mix, n = 7994)
## $chisq
## [1] 92650.52
## 
## $p.value
## [1] 0
## 
## $df
## [1] 253
KMO(cor_matrix_mix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_matrix_mix)
## Overall MSA =  0.71
## MSA for each item = 
##  koi_disposition koi_pdisposition        koi_score    koi_fpflag_nt 
##             0.87             0.73             0.78             0.52 
##    koi_fpflag_ss    koi_fpflag_co    koi_fpflag_ec       koi_period 
##             0.75             0.77             0.87             0.59 
##      koi_time0bk       koi_impact     koi_duration        koi_depth 
##             0.67             0.59             0.65             0.74 
##         koi_prad          koi_teq        koi_insol    koi_model_snr 
##             0.54             0.68             0.59             0.75 
## koi_tce_plnt_num        koi_steff        koi_slogg         koi_srad 
##             0.88             0.63             0.62             0.66 
##               ra              dec       koi_kepmag 
##             0.84             0.81             0.64

Checking for visible correlations between variables:

kepler.cor <-cor (kepler, method="pearson")
corrplot(kepler.cor, order ="alphabet", tl.cex=0.6)

Plot indicates several positive and few negative correlation tendencies. Basing on tests results and correlation plot it can be concluded that data is eligible for further dimension reduction.

5. Dimension Reduction

5.1. Continuous Variables

5.1.1. PCA

Principal Component Analysis is one of the statistical methods of factor analysis. A dataset consisting of N observations, each containing K variables, can be interpreted as a cloud of N points in a K-dimensional space. The goal of PCA is to rotate the coordinate system in such a way as to maximize first the variance of the first coordinate, then the variance of the second coordinate, etc. The coordinate values transformed in this way are called the loads of the generated factors (principal components). In this way, a new observation space is constructed, in which the initial factors explain the most variability.

5.1.1.1. Analysis and Eigenvalues

pca <- prcomp(n.kepler, center=FALSE, scale.=FALSE)
get_eigenvalue(pca)
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   2.8711543        17.944714                    17.94471
## Dim.2   2.0386946        12.741841                    30.68656
## Dim.3   1.8693421        11.683388                    42.36994
## Dim.4   1.4814177         9.258860                    51.62880
## Dim.5   1.2606210         7.878881                    59.50769
## Dim.6   1.0875519         6.797200                    66.30489
## Dim.7   0.9852795         6.157997                    72.46288
## Dim.8   0.8133554         5.083471                    77.54635
## Dim.9   0.7687834         4.804896                    82.35125
## Dim.10  0.6656214         4.160134                    86.51138
## Dim.11  0.5456875         3.410547                    89.92193
## Dim.12  0.4173867         2.608667                    92.53060
## Dim.13  0.3796646         2.372904                    94.90350
## Dim.14  0.3647637         2.279773                    97.18327
## Dim.15  0.2755864         1.722415                    98.90569
## Dim.16  0.1750898         1.094311                   100.00000

5.1.1.2. Visualisation

fviz_eig(pca, addlabels=TRUE)

One of the criterion of choosing a number of dimensions to keep is eigenvalue. The values above show that 6 dimensions should be chosen. The other criterion is the percentage of variance explained that should be at least 5%. On the plot 8 dimensions explain enough variance, however the 8th dimension has eigenvalue around 0.81. For further analysis 7 variables explaining 72,5 % of variance should be selected.

fviz_pca_var(pca, col.var="cos2", alpha.var="contrib", gradient.cols = c("blue", "purple", "red"), repel = TRUE)

5.1.1.3. Contribution of Variables

Dim 1
fviz_contrib(pca, choice = "var", axes = 1, top = 10)

Dim 2
fviz_contrib(pca, choice = "var", axes = 2, top = 10)

Dim 3
fviz_contrib(pca, choice = "var", axes = 3, top = 10)

Dim 4
fviz_contrib(pca, choice = "var", axes = 4, top = 10)

Dim 5
fviz_contrib(pca, choice = "var", axes = 5, top = 10)

Dim 6
fviz_contrib(pca, choice = "var", axes = 6, top = 10)

Dim 7
fviz_contrib(pca, choice = "var", axes = 7, top = 10)

5.1.2. MDS

Multidimensional scaling is a statistical technique aimed at detecting hidden variables that, although not directly observed, explain the similarities and differences between the studied objects.

There are many competing scaling techniques. The entry of the procedure is usually a matrix of distance or similarity between the objects. It may be, for example, a correlation matrix. Multidimensional scaling tends to arrange objects as points in n-dimensional space so that similar objects are closer. The result of the analysis is for each object n real numbers that can be understood as Cartesian coordinates.

5.1.2.1. Analysis

dist.kepler <- dist(n.kepler)
mds <- cmdscale(dist.kepler, k=2)
summary(mds)
##        V1                V2          
##  Min.   :-3.2466   Min.   :-14.8256  
##  1st Qu.:-0.9089   1st Qu.: -0.4039  
##  Median :-0.2801   Median :  0.4027  
##  Mean   : 0.0000   Mean   :  0.0000  
##  3rd Qu.: 0.4809   3rd Qu.:  0.8566  
##  Max.   :41.6389   Max.   : 18.1694

Commentary

5.1.2.2. Visualisation

plot(mds)

The plot presents all variables reduced to only 2-D graph. It can be clearly observed that outliers are present that may affect results. If analytic tools are sensitive to outliers, the further analysis should be performed without those points to increase credibility.

5.2. Mixed Variables

5.2.1. PCAmix

Principal Component Analysis Of Mixed Data performs principal component analysis of a set of individuals (observations) described by a mixture of qualitative and quantitative variables. PCAmix includes ordinary principal component analysis (PCA) and multiple correspondence analysis (MCA) as special cases.

5.2.1.1. Data Preparation and Analysis

quali <- kepler_mix[, c(1, 2)]
quanti <- kepler

pca_mix <- PCAmix(X.quanti = quanti, X.quali = quali, ndim = 18, rename.level=TRUE, graph = FALSE)
pca_mix$eig
##         Eigenvalue  Proportion Cumulative
## dim 1  3.799400927 19.99684698   19.99685
## dim 2  2.481059712 13.05820901   33.05506
## dim 3  1.915053214 10.07922744   43.13428
## dim 4  1.484984540  7.81570811   50.94999
## dim 5  1.395797798  7.34630420   58.29630
## dim 6  1.260135492  6.63229206   64.92859
## dim 7  1.052730922  5.54068906   70.46928
## dim 8  0.984608610  5.18215058   75.65143
## dim 9  0.911799096  4.79894261   80.45037
## dim 10 0.791927709  4.16804057   84.61841
## dim 11 0.695691047  3.66153183   88.27994
## dim 12 0.550161241  2.89558548   91.17553
## dim 13 0.426418371  2.24430722   93.41984
## dim 14 0.378677759  1.99304084   95.41288
## dim 15 0.363847457  1.91498662   97.32786
## dim 16 0.281766117  1.48297956   98.81084
## dim 17 0.180412837  0.94954125   99.76038
## dim 18 0.035931957  0.18911556   99.94950
## dim 19 0.009595193  0.05050102  100.00000

In order to perform PCA for mixed data, the dataset had to be divided into to subsets - one containing qualitative variables and one containing only quantitative ones. The obtained eigenvalues indicate that the number of dimensions should be reduced to 7.

5.2.1.2. Visualisation of the correlations

plot(pca_mix, choice="sqload",main="Squared correlations")

plot(pca_mix, choice="cor",main="Correlation circle")

5.2.2. FAMD

Factor analysis of mixed data (FAMD) is another principal component method dedicated to analyze a data set containing both quantitative and qualitative variables. It makes it possible to analyze the similarity between individuals by taking into account a mixed types of variables. It is a mixture of PCA and MCA - it uses PCA for quantitative variables and MCA for qualitative.

5.2.2.1. Analysis and Eigenvalues

famd <- FAMD(kepler_mix, ncp=23, graph=FALSE)
get_eigenvalue(famd)
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1  4.69275003       19.5531251                    19.55313
## Dim.2  2.54753602       10.6147334                    30.16786
## Dim.3  2.20675371        9.1948071                    39.36267
## Dim.4  2.06657965        8.6107485                    47.97341
## Dim.5  1.48899861        6.2041609                    54.17758
## Dim.6  1.31512006        5.4796669                    59.65724
## Dim.7  1.13605632        4.7335680                    64.39081
## Dim.8  0.98960099        4.1233375                    68.51415
## Dim.9  0.95856903        3.9940376                    72.50819
## Dim.10 0.89498788        3.7291162                    76.23730
## Dim.11 0.84396658        3.5165274                    79.75383
## Dim.12 0.78099437        3.2541432                    83.00797
## Dim.13 0.69039399        2.8766416                    85.88461
## Dim.14 0.63917734        2.6632389                    88.54785
## Dim.15 0.53479002        2.2282917                    90.77614
## Dim.16 0.43074457        1.7947690                    92.57091
## Dim.17 0.40137921        1.6724134                    94.24333
## Dim.18 0.37493036        1.5622098                    95.80554
## Dim.19 0.34684451        1.4451854                    97.25072
## Dim.20 0.27389069        1.1412112                    98.39193
## Dim.21 0.17700331        0.7375138                    99.12945
## Dim.22 0.16443456        0.6851440                    99.81459
## Dim.23 0.03515212        0.1464672                    99.96106

5.2.2.2. Visualisation

fviz_eig(famd, ncp=10, addlabels=TRUE)

fviz_famd_var(famd, col.var="cos2", alpha.var="contrib", gradient.cols = c("blue", "purple", "red"), repel = TRUE)

5.2.2.3. Contribution of Variables

According to eigenvalues the dimesions should be reduced to the amount of 7, however, 7th dimension does not explain more than 5% of variance. For further analysis 6 dimensions will be chosen what explains almost 60% of variance which is acceptable.

Dim 1
fviz_contrib(famd, choice = "var", axes = 1, top = 10)

Dim 2
fviz_contrib(famd, choice = "var", axes = 2, top = 10)

Dim 3
fviz_contrib(famd, choice = "var", axes = 3, top = 10)

Dim 4
fviz_contrib(famd, choice = "var", axes = 4, top = 10)

Dim 5
fviz_contrib(famd, choice = "var", axes = 5, top = 10)

Dim 6
fviz_contrib(famd, choice = "var", axes = 6, top = 10)

6. Conclusions

To conclued, this paper aimed to present several method for clustering and dimension reduction of mixed data type set. K-means and PAM were chosen for the continuous variables, however they did not deliver acceptable results. The best quality of clusters was achieved by the PAM clustering using Gower distance on a mixed data set. For reducing dimensions PCA and MDS methods were chosen for continuous variables, and FAMD and PCAmix for mixed variables. The overall conclusion is that in order to achieve the best significance, robustness and quality of analysis several methods need to be compared and tested.