Specialisation | Data Analysis and Interpretation |
Course | Machine Learning for Data Analysis |
Education Institution | Wesleyan University |
Publisher | Coursera |
Assignment | Running a k-means Cluster Analysis |
Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allow scientists to put together a big jigsaw puzzle.
An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is known and the curiosity, or even need, to find out much more.
When trying to identify an association of the physical and geographical characteristics of a crater with its morphology using a classification tree algorithm it is possible to identify a good number for some particular and more common formations. The identification does vary depending on the kind of formation and there are some characteristics that are common to some formations.
With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir was released on early Oct/15), the data set chosen for this assignment is the Mars Craters.
The Mars Craters Study, presents a global database that includes 378,540 Mars craters, with diameter of 1 km or larger, that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).
The data set was made available by Wesleyan University/Coursera as part of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.
The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.
As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.
MorphoE1_RD is a categorical variable that has value “Yes” is the primary morphology is classified as “Rd” (Radial), and “No” otherwise.
Quadrangle is a variable derived from both LATITUDE_CIRCLE_IMAGE and LONGITUDE_CIRCLE_IMAGE variables. (see below a definition from Wikipedia)
Each Quadrangle has approximately from one to five percent of the recorded craters, being MC-16, Memnonia the one with the most observations (20455 = 5.32%), and MC-10: Lunae Palus the one with the lower number of records (3478 = 0.90%).
List of quadrangles on Mars (Wikipedia):
The surface of Mars has been divided into 30 quadrangles by the United States Geological Survey, so named because their borders lie along lines of latitude and longitude and so maps appear rectangular. Martian quadrangles are named after local features and are numbered with the prefix “MC” for “Mars Chart”. West longitude is used.
The following imagemap of the planet Mars is divided into 30 linked quadrangles. North is at the top; 0°N 180°W is at the far left on the equator. The map images were taken by the Mars Global Surveyor.
From Wikipedia, Source: http://photojournal.jpl.nasa.gov/catalog/PIA03467
Can the morphology of craters be grouped by its physical and geographical characteristics?
The variables Quadrangle, DIAM_CIRCLE_IMAGE, DEPTH_RIMFLOOR_TOPOG and NUMBER_LAYERS were used in a k-means analysis to identify groups in the catalogue of craters on Mars.
All clustering variables were standardised to have a mean of 0 and a standard deviation of 1.
As the number of clusters are not known before hand, the test was done from one to nine possible groups. Further a plot of the variances, using R-Square, was made to help in identifying an optimal number of groups. (See Figure 1)
The chart points to three and six as possible solutions to the number of clusters.
A Canonical discriminant analyses was used to reduce the number of clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (See Figure 2) shows that observations are densely packed in clusters 1, the same for cluster 3 but with some variance. Cluster 2 has some density but also a large variance. A reasonably well defined border can be identified between the three clustered areas.
Cluster 1 aggregates craters with smaller diameter, depth and number of layers. Cluster 2 has the larger and deeper craters but with fewer layers. Cluster 3 has craters with average diameter, no so deep and more layers. (See Table 1)
Figure 1 - Elbow curve of R-Square values for the nine cluster solutions
Figure 2 - Plot of the first two canonical variables for the clustering variables by cluster
Cluster | Quadrangle | DIAM_CIRCLE_IMAGE | DEPTH_RIMFLOOR_TOPOG | NUMBER_LAYERS |
---|---|---|---|---|
1 | 0.112239498 | -0.280344716 | -0.429614158 | -0.480625152 |
2 | 0.437416238 | 3.118525142 | 2.274367342 | -0.531405570 |
3 | -0.315140323 | 0.007455715 | 0.475155741 | 1.101063849 |
1 /* Use Course's Library */
2 LIBNAME mydata "/courses/d1406ae5ba27fe300" ACCESS = readonly;
3
4 DATA WORK;
5 SET mydata.marscrater_pds;
6
7 WHERE MORPHOLOGY_EJECTA_1 NE " ";
8
9 /* Collapse the Morphology of Eject 1 to its Main Feature, to reduce the output */
10 IF (INDEX(MORPHOLOGY_EJECTA_1, "/") = 0)
11 THEN MorphoE1 = MORPHOLOGY_EJECTA_1;
12 ELSE MorphoE1 = SUBSTR(MORPHOLOGY_EJECTA_1, 1, INDEX(MORPHOLOGY_EJECTA_1, "/") - 1);
13 MorphoE1 = UPCASE(TRIM(MorphoE1));
14
15 /* Does the the Morphology 1 equals to "RD" */
16 IF MorphoE1 = "RD"
17 THEN MorphoE1_RD = 1;
18 ELSE MorphoE1_RD = 0;
19
20 /* convert coordinates to Quadrangles: https://en.wikipedia.org/wiki/List_of_quadrangles_on_Mars */
21 LA = LATITUDE_CIRCLE_IMAGE;
22 LO = LONGITUDE_CIRCLE_IMAGE + 180;
23 IF LA >= 65 AND LA <= 90 AND LO >= 0 AND LO <= 360 THEN Quadrangle = 01;
24 IF LA >= 30 AND LA < 65 AND LO >= 120 AND LO < 180 THEN Quadrangle = 02;
25 IF LA >= 30 AND LA < 65 AND LO >= 60 AND LO < 120 THEN Quadrangle = 03;
26 IF LA >= 30 AND LA < 65 AND LO >= 0 AND LO < 60 THEN Quadrangle = 04;
27 IF LA >= 30 AND LA < 65 AND LO >= 300 AND LO <= 360 THEN Quadrangle = 05;
28 IF LA >= 30 AND LA < 65 AND LO >= 240 AND LO < 300 THEN Quadrangle = 06;
29 IF LA >= 30 AND LA < 65 AND LO >= 180 AND LO < 240 THEN Quadrangle = 07;
30 IF LA >= 0 AND LA < 30 AND LO >= 135 AND LO < 180 THEN Quadrangle = 08;
31 IF LA >= 0 AND LA < 30 AND LO >= 90 AND LO < 135 THEN Quadrangle = 09;
32 IF LA >= 0 AND LA < 30 AND LO >= 45 AND LO < 90 THEN Quadrangle = 10;
33 IF LA >= 0 AND LA < 30 AND LO >= 0 AND LO < 45 THEN Quadrangle = 11;
34 IF LA >= 0 AND LA < 30 AND LO >= 315 AND LO <= 360 THEN Quadrangle = 12;
35 IF LA >= 0 AND LA < 30 AND LO >= 270 AND LO < 315 THEN Quadrangle = 13;
36 IF LA >= 0 AND LA < 30 AND LO >= 225 AND LO < 270 THEN Quadrangle = 14;
37 IF LA >= 0 AND LA < 30 AND LO >= 180 AND LO < 225 THEN Quadrangle = 15;
38 IF LA >= -30 AND LA < 0 AND LO >= 135 AND LO < 180 THEN Quadrangle = 16;
39 IF LA >= -30 AND LA < 0 AND LO >= 90 AND LO < 135 THEN Quadrangle = 17;
40 IF LA >= -30 AND LA < 0 AND LO >= 45 AND LO < 90 THEN Quadrangle = 18;
41 IF LA >= -30 AND LA < 0 AND LO >= 0 AND LO < 45 THEN Quadrangle = 19;
42 IF LA >= -30 AND LA < 0 AND LO >= 315 AND LO <= 360 THEN Quadrangle = 20;
43 IF LA >= -30 AND LA < 0 AND LO >= 270 AND LO < 315 THEN Quadrangle = 21;
44 IF LA >= -30 AND LA < 0 AND LO >= 225 AND LO < 270 THEN Quadrangle = 22;
45 IF LA >= -30 AND LA < 0 AND LO >= 180 AND LO < 225 THEN Quadrangle = 23;
46 IF LA >= -65 AND LA < -30 AND LO >= 120 AND LO < 180 THEN Quadrangle = 24;
47 IF LA >= -65 AND LA < -30 AND LO >= 60 AND LO < 120 THEN Quadrangle = 25;
48 IF LA >= -65 AND LA < -30 AND LO >= 0 AND LO < 60 THEN Quadrangle = 26;
49 IF LA >= -65 AND LA < -30 AND LO >= 300 AND LO <= 360 THEN Quadrangle = 27;
50 IF LA >= -65 AND LA < -30 AND LO >= 240 AND LO < 300 THEN Quadrangle = 28;
51 IF LA >= -65 AND LA < -30 AND LO >= 180 AND LO < 240 THEN Quadrangle = 29;
52 IF LA >= -90 AND LA < -65 AND LO >= 0 AND LO <= 360 THEN Quadrangle = 30;
53
54 LABEL Quadrangle = "Quadrangle"
55 DIAM_CIRCLE_IMAGE = "Diameter"
56 DEPTH_RIMFLOOR_TOPOG = "Depth"
57 MorphoE1_RD = "Morphology 1-RD"
58 NUMBER_LAYERS = "Layers";
59
60 RUN;
61
62 ODS GRAPHICS ON;
63
64 /* Split data randomly into test and training data */
65 PROC SURVEYSELECT DATA = WORK
66 OUT = TRAINTEST
67 METHOD = SRS
68 SAMPRATE = 0.7
69 SEED = 6587
70 OUTALL;
71 RUN;
72
73 DATA CLUST_TRAIN;
74 SET TRAINTEST;
75 IF selected = 1;
76 RUN;
77
78 DATA CLUST_TEST;
79 SET TRAINTEST;
80 IF selected = 0;
81 RUN;
82
83 /* standardize the clustering variables to have a mean of 0 and standard deviation of 1 */
84 PROC STANDARD DATA = CLUST_TRAIN
85 OUT = CLUST_VAR
86 MEAN = 0
87 STD = 1;
88 VAR Quadrangle
89 DIAM_CIRCLE_IMAGE
90 DEPTH_RIMFLOOR_TOPOG
91 MorphoE1_RD
92 NUMBER_LAYERS;
93 RUN;
94
95 %macro kmean(K);
96 PROC FASTCLUS DATA = CLUST_VAR
97 OUT = OUTDATA&K.
98 OUTSTAT = OUTSTAT&K.
99 MAXCLUSTERS = &K.
100 MAXITER = 300;
101 VAR Quadrangle
102 DIAM_CIRCLE_IMAGE
103 DEPTH_RIMFLOOR_TOPOG
104 NUMBER_LAYERS;
105 %mend;
106
107 %kmean(1);
108 %kmean(2);
109 %kmean(3);
110 %kmean(4);
111 %kmean(5);
112 %kmean(6);
113 %kmean(7);
114 %kmean(8);
115 %kmean(9);
116
117 /* extract r-square values from each cluster solution
118 and then merge them to plot elbow curve */
119 DATA CLUST1;
120 SET OUTSTAT1;
121 NCLUST = 1;
122 IF _type_ = "RSQ";
123 KEEP NCLUST over_all;
124 RUN;
125
126 DATA CLUST2;
127 SET OUTSTAT2;
128 NCLUST = 2;
129 IF _type_ = "RSQ";
130 KEEP NCLUST over_all;
131 RUN;
132
133 DATA CLUST3;
134 SET OUTSTAT3;
135 NCLUST = 3;
136 IF _type_ = "RSQ";
137 KEEP NCLUST over_all;
138 RUN;
139
140 DATA CLUST4;
141 SET OUTSTAT4;
142 NCLUST = 4;
143 IF _type_ = "RSQ";
144 KEEP NCLUST over_all;
145 RUN;
146
147 DATA CLUST5;
148 SET OUTSTAT5;
149 NCLUST = 5;
150 IF _type_ = "RSQ";
151 KEEP NCLUST over_all;
152 RUN;
153
154 DATA CLUST6;
155 SET OUTSTAT6;
156 NCLUST = 6;
157 IF _type_ = "RSQ";
158 KEEP NCLUST over_all;
159 RUN;
160
161 DATA CLUST7;
162 SET OUTSTAT7;
163 NCLUST = 7;
164 IF _type_ = "RSQ";
165 KEEP NCLUST over_all;
166 RUN;
167
168 DATA CLUST8;
169 SET OUTSTAT8;
170 NCLUST = 8;
171 IF _type_ = "RSQ";
172 KEEP NCLUST over_all;
173 RUN;
174
175 DATA CLUST9;
176 SET OUTSTAT9;
177 NCLUST = 9;
178 IF _type_ = "RSQ";
179 KEEP NCLUST over_all;
180 RUN;
181
182 DATA CLUST_RSQUARE;
183 SET CLUST1 CLUST2 CLUST3 CLUST4 CLUST5 CLUST6 CLUST7 CLUST8 CLUST9;
184 RUN;
185
186 /* plot elbow curve using r-square values */
187 SYMBOL1 COLOR = BLUE INTERPOL = JOIN;
188 PROC GPLOT DATA = CLUST_RSQUARE;
189 PLOT over_all * NCLUST;
190 RUN;
191
192 /* further examine cluster solution for the number of clusters suggested by the elbow curve */
193
194 /* plot clusters for 3 cluster solution */
195 PROC CANDISC DATA = OUTDATA3
196 OUT = CLUST_CAN;
197 CLASS CLUSTER;
198 VAR Quadrangle
199 DIAM_CIRCLE_IMAGE
200 DEPTH_RIMFLOOR_TOPOG
201 NUMBER_LAYERS;
202 RUN;
203
204 PROC SGPLOT DATA = CLUST_CAN;
205 SCATTER Y = CAN2 X = CAN1 / GROUP = CLUSTER;
206 RUN;
207
208 /* validate clusters on MorphoE1_RD */
209 PROC SORT DATA = CLUST_CAN;
210 BY CLUSTER;
211 RUN;
212
213 PROC MEANS DATA = CLUST_CAN;
214 VAR MorphoE1_RD;
215 BY CLUSTER;
216 RUN;
217
218 PROC ANOVA DATA = CLUST_CAN;
219 CLASS CLUSTER;
220 MODEL MorphoE1_RD = CLUSTER;
221 MEANS CLUSTER / TUKEY;
222 RUN;