Course Details

Specialisation Data Analysis and Interpretation
Course Machine Learning for Data Analysis
Education Institution Wesleyan University
Publisher Coursera
Assignment Running a k-means Cluster Analysis

Introduction

Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allow scientists to put together a big jigsaw puzzle.

An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is known and the curiosity, or even need, to find out much more.

Summary

When trying to identify an association of the physical and geographical characteristics of a crater with its morphology using a classification tree algorithm it is possible to identify a good number for some particular and more common formations. The identification does vary depending on the kind of formation and there are some characteristics that are common to some formations.

The Data Set

With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir was released on early Oct/15), the data set chosen for this assignment is the Mars Craters.

The Mars Craters Study, presents a global database that includes 378,540 Mars craters, with diameter of 1 km or larger, that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).

The data set was made available by Wesleyan University/Coursera as part of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.

Topic of Interest

The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.

Codebook

As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.

Variables

  • CRATER_ID: crater ID for internal sue, based upon the region of the planet (\({1 \over 16}\)), the “pass” under which the crate was identified, ad the order in which it was identified
  • LATITUDE_CIRCLE_IMAGE: latitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees North)
  • LONGITUDE_CIRCLE_IMAGE: longitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees East)
  • DIAM_CIRCLE_IMAGE: diameter from a non-linear least squares circle fit to the vertices selected to manually identify the crater rim (units are km)
  • DEPTH_RIMFLOOR_TOPOG: average elevation of each of the manually determined N points along (or inside) the crater rim (units are km)
    • Depth Rim: Points are selected as relative topographic highs under the assumption they are the least eroded so most original points along the rim
    • Depth Floor: Points were chosen as the lowest elevation that did not include visible embedded craters
  • MORPHOLOGY_EJECTA_1: ejecta morphology classified.
    • If there are multiple values, separated by a “/”, then the order is the inner-most ejecta through the outer-most, or the top-most through the bottom-most
  • MORPHOLOGY_EJECTA_2: the morphology of the layer(s) itself/themselves. This classification system is unique to this work.
  • MORPHOLOGY_EJECTA_3: overall texture and/or shape of some of the layer(s)/ejecta that are generally unique and deserve separate morphological classification.
  • NUMBER_LAYERS: the maximum number of cohesive layers in any azimuthal direction that could be reliably identified

Extra Variables

  • MorphoE1_RD is a categorical variable that has value “Yes” is the primary morphology is classified as “Rd” (Radial), and “No” otherwise.

  • Quadrangle is a variable derived from both LATITUDE_CIRCLE_IMAGE and LONGITUDE_CIRCLE_IMAGE variables. (see below a definition from Wikipedia)

Each Quadrangle has approximately from one to five percent of the recorded craters, being MC-16, Memnonia the one with the most observations (20455 = 5.32%), and MC-10: Lunae Palus the one with the lower number of records (3478 = 0.90%).

List of quadrangles on Mars (Wikipedia):

The surface of Mars has been divided into 30 quadrangles by the United States Geological Survey, so named because their borders lie along lines of latitude and longitude and so maps appear rectangular. Martian quadrangles are named after local features and are numbered with the prefix “MC” for “Mars Chart”. West longitude is used.

The following imagemap of the planet Mars is divided into 30 linked quadrangles. North is at the top; 0°N 180°W is at the far left on the equator. The map images were taken by the Mars Global Surveyor.

Running a k-means Cluster Analysis

Can the morphology of craters be grouped by its physical and geographical characteristics?

The variables Quadrangle, DIAM_CIRCLE_IMAGE, DEPTH_RIMFLOOR_TOPOG and NUMBER_LAYERS were used in a k-means analysis to identify groups in the catalogue of craters on Mars.

All clustering variables were standardised to have a mean of 0 and a standard deviation of 1.

As the number of clusters are not known before hand, the test was done from one to nine possible groups. Further a plot of the variances, using R-Square, was made to help in identifying an optimal number of groups. (See Figure 1)

The chart points to three and six as possible solutions to the number of clusters.

A Canonical discriminant analyses was used to reduce the number of clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (See Figure 2) shows that observations are densely packed in clusters 1, the same for cluster 3 but with some variance. Cluster 2 has some density but also a large variance. A reasonably well defined border can be identified between the three clustered areas.

Cluster 1 aggregates craters with smaller diameter, depth and number of layers. Cluster 2 has the larger and deeper craters but with fewer layers. Cluster 3 has craters with average diameter, no so deep and more layers. (See Table 1)

Using SAS

SAS Output

Figure 1 - Elbow Curve for cluster from 1 to 9

Figure 1 - Elbow curve of R-Square values for the nine cluster solutions

Figure 1 - Elbow curve of R-Square values for the nine cluster solutions

Figure 2 - Canonical plot for Three Clusters

Figure 2 - Plot of the first two canonical variables for the clustering variables by cluster

Figure 2 - Plot of the first two canonical variables for the clustering variables by cluster

Table 1 - Cluster Means

Cluster Quadrangle DIAM_CIRCLE_IMAGE DEPTH_RIMFLOOR_TOPOG NUMBER_LAYERS
1 0.112239498 -0.280344716 -0.429614158 -0.480625152
2 0.437416238 3.118525142 2.274367342 -0.531405570
3 -0.315140323 0.007455715 0.475155741 1.101063849

SAS Code

  1 /* Use Course's Library */
  2 LIBNAME mydata "/courses/d1406ae5ba27fe300" ACCESS = readonly;
  3 
  4 DATA WORK;
  5   SET mydata.marscrater_pds;
  6 
  7   WHERE MORPHOLOGY_EJECTA_1 NE " ";
  8 
  9   /* Collapse the Morphology of Eject 1 to its Main Feature, to reduce the output */
 10   IF (INDEX(MORPHOLOGY_EJECTA_1, "/") = 0)
 11     THEN MorphoE1 = MORPHOLOGY_EJECTA_1;
 12     ELSE MorphoE1 = SUBSTR(MORPHOLOGY_EJECTA_1, 1, INDEX(MORPHOLOGY_EJECTA_1, "/") - 1);
 13   MorphoE1 = UPCASE(TRIM(MorphoE1));
 14 
 15   /* Does the the Morphology 1 equals to "RD" */
 16   IF MorphoE1 = "RD"
 17     THEN MorphoE1_RD = 1;
 18     ELSE MorphoE1_RD = 0;
 19 
 20   /* convert coordinates to Quadrangles: https://en.wikipedia.org/wiki/List_of_quadrangles_on_Mars */
 21   LA = LATITUDE_CIRCLE_IMAGE;
 22   LO = LONGITUDE_CIRCLE_IMAGE + 180;
 23   IF LA >=  65 AND LA <=  90 AND LO >=   0 AND LO <= 360 THEN Quadrangle = 01;
 24   IF LA >=  30 AND LA <   65 AND LO >= 120 AND LO <  180 THEN Quadrangle = 02;
 25   IF LA >=  30 AND LA <   65 AND LO >=  60 AND LO <  120 THEN Quadrangle = 03;
 26   IF LA >=  30 AND LA <   65 AND LO >=   0 AND LO <   60 THEN Quadrangle = 04;
 27   IF LA >=  30 AND LA <   65 AND LO >= 300 AND LO <= 360 THEN Quadrangle = 05;
 28   IF LA >=  30 AND LA <   65 AND LO >= 240 AND LO <  300 THEN Quadrangle = 06;
 29   IF LA >=  30 AND LA <   65 AND LO >= 180 AND LO <  240 THEN Quadrangle = 07;
 30   IF LA >=   0 AND LA <   30 AND LO >= 135 AND LO <  180 THEN Quadrangle = 08;
 31   IF LA >=   0 AND LA <   30 AND LO >=  90 AND LO <  135 THEN Quadrangle = 09;
 32   IF LA >=   0 AND LA <   30 AND LO >=  45 AND LO <   90 THEN Quadrangle = 10;
 33   IF LA >=   0 AND LA <   30 AND LO >=   0 AND LO <   45 THEN Quadrangle = 11;
 34   IF LA >=   0 AND LA <   30 AND LO >= 315 AND LO <= 360 THEN Quadrangle = 12;
 35   IF LA >=   0 AND LA <   30 AND LO >= 270 AND LO <  315 THEN Quadrangle = 13;
 36   IF LA >=   0 AND LA <   30 AND LO >= 225 AND LO <  270 THEN Quadrangle = 14;
 37   IF LA >=   0 AND LA <   30 AND LO >= 180 AND LO <  225 THEN Quadrangle = 15;
 38   IF LA >= -30 AND LA <    0 AND LO >= 135 AND LO <  180 THEN Quadrangle = 16;
 39   IF LA >= -30 AND LA <    0 AND LO >=  90 AND LO <  135 THEN Quadrangle = 17;
 40   IF LA >= -30 AND LA <    0 AND LO >=  45 AND LO <   90 THEN Quadrangle = 18;
 41   IF LA >= -30 AND LA <    0 AND LO >=   0 AND LO <   45 THEN Quadrangle = 19;
 42   IF LA >= -30 AND LA <    0 AND LO >= 315 AND LO <= 360 THEN Quadrangle = 20;
 43   IF LA >= -30 AND LA <    0 AND LO >= 270 AND LO <  315 THEN Quadrangle = 21;
 44   IF LA >= -30 AND LA <    0 AND LO >= 225 AND LO <  270 THEN Quadrangle = 22;
 45   IF LA >= -30 AND LA <    0 AND LO >= 180 AND LO <  225 THEN Quadrangle = 23;
 46   IF LA >= -65 AND LA <  -30 AND LO >= 120 AND LO <  180 THEN Quadrangle = 24;
 47   IF LA >= -65 AND LA <  -30 AND LO >=  60 AND LO <  120 THEN Quadrangle = 25;
 48   IF LA >= -65 AND LA <  -30 AND LO >=   0 AND LO <   60 THEN Quadrangle = 26;
 49   IF LA >= -65 AND LA <  -30 AND LO >= 300 AND LO <= 360 THEN Quadrangle = 27;
 50   IF LA >= -65 AND LA <  -30 AND LO >= 240 AND LO <  300 THEN Quadrangle = 28;
 51   IF LA >= -65 AND LA <  -30 AND LO >= 180 AND LO <  240 THEN Quadrangle = 29;
 52   IF LA >= -90 AND LA <  -65 AND LO >=   0 AND LO <= 360 THEN Quadrangle = 30;
 53 
 54   LABEL Quadrangle             = "Quadrangle"
 55         DIAM_CIRCLE_IMAGE      = "Diameter"
 56         DEPTH_RIMFLOOR_TOPOG   = "Depth"
 57         MorphoE1_RD            = "Morphology 1-RD"
 58         NUMBER_LAYERS          = "Layers";
 59 
 60 RUN;
 61 
 62 ODS GRAPHICS ON;
 63 
 64 /* Split data randomly into test and training data */
 65 PROC SURVEYSELECT DATA     = WORK
 66                   OUT      = TRAINTEST
 67                   METHOD   = SRS
 68                   SAMPRATE = 0.7
 69                   SEED     = 6587
 70                   OUTALL;
 71 RUN;
 72 
 73 DATA CLUST_TRAIN;
 74   SET TRAINTEST;
 75   IF selected = 1;
 76 RUN;
 77 
 78 DATA CLUST_TEST;
 79   SET TRAINTEST;
 80   IF selected = 0;
 81 RUN;
 82 
 83 /* standardize the clustering variables to have a mean of 0 and standard deviation of 1 */
 84 PROC STANDARD DATA = CLUST_TRAIN
 85               OUT  = CLUST_VAR
 86               MEAN = 0
 87               STD  = 1;
 88   VAR Quadrangle
 89       DIAM_CIRCLE_IMAGE
 90       DEPTH_RIMFLOOR_TOPOG
 91       MorphoE1_RD
 92       NUMBER_LAYERS;
 93 RUN;
 94 
 95 %macro kmean(K);
 96 PROC FASTCLUS DATA        = CLUST_VAR
 97               OUT         = OUTDATA&K.
 98               OUTSTAT     = OUTSTAT&K.
 99               MAXCLUSTERS = &K.
100               MAXITER     = 300;
101   VAR Quadrangle
102       DIAM_CIRCLE_IMAGE
103       DEPTH_RIMFLOOR_TOPOG
104       NUMBER_LAYERS;
105 %mend;
106 
107 %kmean(1);
108 %kmean(2);
109 %kmean(3);
110 %kmean(4);
111 %kmean(5);
112 %kmean(6);
113 %kmean(7);
114 %kmean(8);
115 %kmean(9);
116 
117 /* extract r-square values from each cluster solution
118    and then merge them to plot elbow curve */
119 DATA CLUST1;
120   SET OUTSTAT1;
121   NCLUST = 1;
122   IF _type_ = "RSQ";
123   KEEP NCLUST over_all;
124 RUN;
125 
126 DATA CLUST2;
127   SET OUTSTAT2;
128   NCLUST = 2;
129   IF _type_ = "RSQ";
130   KEEP NCLUST over_all;
131 RUN;
132 
133 DATA CLUST3;
134   SET OUTSTAT3;
135   NCLUST = 3;
136   IF _type_ = "RSQ";
137   KEEP NCLUST over_all;
138 RUN;
139 
140 DATA CLUST4;
141   SET OUTSTAT4;
142   NCLUST = 4;
143   IF _type_ = "RSQ";
144   KEEP NCLUST over_all;
145 RUN;
146 
147 DATA CLUST5;
148   SET OUTSTAT5;
149   NCLUST = 5;
150   IF _type_ = "RSQ";
151   KEEP NCLUST over_all;
152 RUN;
153 
154 DATA CLUST6;
155   SET OUTSTAT6;
156   NCLUST = 6;
157   IF _type_ = "RSQ";
158   KEEP NCLUST over_all;
159 RUN;
160 
161 DATA CLUST7;
162   SET OUTSTAT7;
163   NCLUST = 7;
164   IF _type_ = "RSQ";
165   KEEP NCLUST over_all;
166 RUN;
167 
168 DATA CLUST8;
169   SET OUTSTAT8;
170   NCLUST = 8;
171   IF _type_ = "RSQ";
172   KEEP NCLUST over_all;
173 RUN;
174 
175 DATA CLUST9;
176   SET OUTSTAT9;
177   NCLUST = 9;
178   IF _type_ = "RSQ";
179   KEEP NCLUST over_all;
180 RUN;
181 
182 DATA CLUST_RSQUARE;
183   SET CLUST1 CLUST2 CLUST3 CLUST4 CLUST5 CLUST6 CLUST7 CLUST8 CLUST9;
184 RUN;
185 
186 /* plot elbow curve using r-square values */
187 SYMBOL1 COLOR = BLUE INTERPOL = JOIN;
188 PROC GPLOT DATA = CLUST_RSQUARE;
189   PLOT over_all * NCLUST;
190 RUN;
191 
192 /* further examine cluster solution for the number of clusters suggested by the elbow curve */
193 
194 /* plot clusters for 3 cluster solution */
195 PROC CANDISC DATA = OUTDATA3
196              OUT  = CLUST_CAN;
197   CLASS CLUSTER;
198   VAR Quadrangle
199       DIAM_CIRCLE_IMAGE
200       DEPTH_RIMFLOOR_TOPOG
201       NUMBER_LAYERS;
202 RUN;
203 
204 PROC SGPLOT DATA = CLUST_CAN;
205   SCATTER Y = CAN2 X = CAN1 / GROUP = CLUSTER;
206 RUN;
207 
208 /* validate clusters on MorphoE1_RD */
209 PROC SORT DATA = CLUST_CAN;
210   BY CLUSTER;
211 RUN;
212 
213 PROC MEANS DATA = CLUST_CAN;
214   VAR MorphoE1_RD;
215   BY  CLUSTER;
216 RUN;
217 
218 PROC ANOVA DATA = CLUST_CAN;
219   CLASS CLUSTER;
220   MODEL MorphoE1_RD = CLUSTER;
221   MEANS CLUSTER / TUKEY;
222 RUN;