Course Details

Specialisation Data Analysis and Interpretation
Course Machine Learning for Data Analysis
Education Institution Wesleyan University
Publisher Coursera
Assignment Running a Classification Tree

Introduction

Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allow scientists to put together a big jigsaw puzzle.

An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is known and the curiosity, or even need, to find out much more.

Summary

When trying to identify an association of the physical and geographical characteristics of a crater with its morphology using a classification tree algorithm it is possible to identify a good number for some particular and more common formations. The identification does vary depending on the kind of formation and there are some characteristics that are common to some formations.

The Data Set

With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir was released on early Oct/15), the data set chosen for this assignment is the Mars Craters.

The Mars Craters Study, presents a global database that includes 378,540 Mars craters, with diameter of 1 km or larger, that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).

The data set was made available by Wesleyan University/Coursera as part of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.

Topic of Interest

The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.

Codebook

As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.

Variables

  • CRATER_ID: crater ID for internal sue, based upon the region of the planet (\({1 \over 16}\)), the “pass” under which the crate was identified, ad the order in which it was identified
  • LATITUDE_CIRCLE_IMAGE: latitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees North)
  • LONGITUDE_CIRCLE_IMAGE: longitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees East)
  • DIAM_CIRCLE_IMAGE: diameter from a non-linear least squares circle fit to the vertices selected to manually identify the crater rim (units are km)
  • DEPTH_RIMFLOOR_TOPOG: average elevation of each of the manually determined N points along (or inside) the crater rim (units are km)
    • Depth Rim: Points are selected as relative topographic highs under the assumption they are the least eroded so most original points along the rim
    • Depth Floor: Points were chosen as the lowest elevation that did not include visible embedded craters
  • MORPHOLOGY_EJECTA_1: ejecta morphology classified.
    • If there are multiple values, separated by a “/”, then the order is the inner-most ejecta through the outer-most, or the top-most through the bottom-most
  • MORPHOLOGY_EJECTA_2: the morphology of the layer(s) itself/themselves. This classification system is unique to this work.
  • MORPHOLOGY_EJECTA_3: overall texture and/or shape of some of the layer(s)/ejecta that are generally unique and deserve separate morphological classification.
  • NUMBER_LAYERS: the maximum number of cohesive layers in any azimuthal direction that could be reliably identified

Extra Variables

  • MorphoE1_RD is a categorical variable that has value “Yes” is the primary morphology is classified as “Rd” (Radial), and “No” otherwise.

  • Quadrangle is a variable derived from both LATITUDE_CIRCLE_IMAGE and LONGITUDE_CIRCLE_IMAGE variables. (see below a definition from Wikipedia)

Each Quadrangle has approximatelly from one to five percent of the recorded craters, being MC-16, Memnonia the one with the most observations (20455 = 5.32%), and MC-10: Lunae Palus the one with the lower number of records (3478 = 0.90%).

List of quadrangles on Mars (Wikipedia):

The surface of Mars has been divided into 30 quadrangles by the United States Geological Survey, so named because their borders lie along lines of latitude and longitude and so maps appear rectangular. Martian quadrangles are named after local features and are numbered with the prefix “MC” for “Mars Chart”. West longitude is used.

The following imagemap of the planet Mars is divided into 30 linked quadrangles. North is at the top; 0°N 180°W is at the far left on the equator. The map images were taken by the Mars Global Surveyor.

Running a Classification Tree

Is the morphology of a crater strongly associate with its physical and geographical characteristics?

The Test

The variables Quadrangle, DIAM_CIRCLE_IMAGE, DEPTH_RIMFLOOR_TOPOG and NUMBER_LAYERS were used to classify the main characteristic of MORPHOLOGY_EJECTA_1 as “Rd” (Radial) or not (coded as “Yes” or “No” in the variable MorphoE1_RD).

The parameter Grow was set to Entropy to control the growth of the tree and the parameter Prune was set to Cost-Complexity.

The Results

Using a seed of 6587 the cross validation found 171 leaves with an Minimum Average Standard Error of 0.0413.

The Confusion Matrix shows that the model can predict that a crater is not Radial in 98% of the cases (1 minus an error rate of 0.0124). The correct classification of the crater as being Radial is less efficient with a 42% margin (1 minus an error rate of 0.5792).

The Area Under the Curve (AUC) was 0.8985 and the Gini was 0.0808 for the training data set.

The most important variable to contribute to the classification is DEPTH_RIMFLOOR_TOPOG, being also the first variable to split the tree at its root with values of 0.010.

Using SAS

SAS Output

Results: W01-Running a Classification Tree.sas

Results: W01-Running a Classification Tree.sas

Mars' Craters - Decision Trees - Morphology = RD - Full Dataset

The HPSPLIT Procedure

The HPSPLIT Procedure

Performance Information

Performance Information
Execution Mode Single-Machine
Number of Threads 2

Data Access Information

Data Access Information
Data Engine Role Path
WORK.WORK V9 Input On Client

Model Information

Model Information
Split Criterion Used Entropy
Pruning Method Cost-Complexity
Subtree Evaluation Criterion Cost-Complexity
Number of Branches 2
Maximum Tree Depth Requested 10
Maximum Tree Depth Achieved 10
Tree Depth 10
Number of Leaves Before Pruning 505
Number of Leaves After Pruning 177
Model Event Level No

Observation Information

Number of Observations Read 384343
Number of Observations Used 384343

Mars' Craters - Decision Trees - Morphology = RD - Full Dataset

The HPSPLIT Procedure

Cross Validation

Cost-Complexity

Plot of Cross Validation Average ASE, including error estimates, while varying pruning parameters for MorphoE1_RD

Mars' Craters - Decision Trees - Morphology = RD - Full Dataset

The HPSPLIT Procedure

Tree Plots

Overview

Tree Overview Plot for MorphoE1_RD

Subtree, Starting at Node=0

Subtree Detail Plot for MorphoE1_RD starting at node 0 down to depth 3

Mars' Craters - Decision Trees - Morphology = RD - Full Dataset

The HPSPLIT Procedure

Model Assessment

Confusion Matrix

Model-Based Confusion Matrix
Actual Predicted Error
Rate
No Yes
No 352829 4445 0.0124
Yes 15679 11390 0.5792

Fit Statistics

Model-Based Fit Statistics for Selected Tree
N
Leaves
ASE Mis-
class
Sensitivity Specificity Entropy Gini RSS AUC
177 0.0404 0.0524 0.9876 0.4208 0.2153 0.0808 31046.1 0.8985

ROC Plot

Receiver Operating Characteristic (ROC) Curve for MorphoE1_RD

Variable Importance

Variable Importance
Variable Variable
Label
Training Count
Relative Importance
DEPTH_RIMFLOOR_TOPOG Depth 1.0000 114.4 50
NUMBER_LAYERS Layers 0.4807 55.0158 21
DIAM_CIRCLE_IMAGE Diameter 0.3687 42.1979 25
Quadrangle Quadrangle 0.3240 37.0808 80

SAS Code

 1 /* Use Course's Library */
 2 LIBNAME mydata "/courses/d1406ae5ba27fe300" ACCESS = readonly;
 3 
 4 DATA WORK;
 5   SET mydata.marscrater_pds;
 6 
 7   /* Collapse the Morphology of Eject 1 to its Main Feature, to reduce the output */
 8   IF (INDEX(MORPHOLOGY_EJECTA_1, "/") = 0)
 9     THEN MorphoE1 = MORPHOLOGY_EJECTA_1;
10     ELSE MorphoE1 = SUBSTR(MORPHOLOGY_EJECTA_1, 1, INDEX(MORPHOLOGY_EJECTA_1, "/") - 1);
11   MorphoE1 = UPCASE(TRIM(MorphoE1));
12 
13   /* Does the the Morphology 1 equals to "RD" */
14   IF MorphoE1 = "RD"
15     THEN MorphoE1_RD = "Yes";
16     ELSE MorphoE1_RD = "No";
17 
18   /* convert coordinates to Quadrangles: https://en.wikipedia.org/wiki/List_of_quadrangles_on_Mars */
19   LA = LATITUDE_CIRCLE_IMAGE;
20   LO = LONGITUDE_CIRCLE_IMAGE + 180;
21   IF LA >=  65 AND LA <=  90 AND LO >=   0 AND LO <= 360 THEN Quadrangle = "MC-01: Mare Boreum (North Pole)";
22   IF LA >=  30 AND LA <   65 AND LO >= 120 AND LO <  180 THEN Quadrangle = "MC-02: Diacria";
23   IF LA >=  30 AND LA <   65 AND LO >=  60 AND LO <  120 THEN Quadrangle = "MC-03: Arcadia";
24   IF LA >=  30 AND LA <   65 AND LO >=   0 AND LO <   60 THEN Quadrangle = "MC-04: Mare Acidalium";
25   IF LA >=  30 AND LA <   65 AND LO >= 300 AND LO <= 360 THEN Quadrangle = "MC-05: Ismenius Lacus";
26   IF LA >=  30 AND LA <   65 AND LO >= 240 AND LO <  300 THEN Quadrangle = "MC-06: Casius";
27   IF LA >=  30 AND LA <   65 AND LO >= 180 AND LO <  240 THEN Quadrangle = "MC-07: Cebrenia";
28   IF LA >=   0 AND LA <   30 AND LO >= 135 AND LO <  180 THEN Quadrangle = "MC-08: Amazonis";
29   IF LA >=   0 AND LA <   30 AND LO >=  90 AND LO <  135 THEN Quadrangle = "MC-09: Tharsis";
30   IF LA >=   0 AND LA <   30 AND LO >=  45 AND LO <   90 THEN Quadrangle = "MC-10: Lunae Palus";
31   IF LA >=   0 AND LA <   30 AND LO >=   0 AND LO <   45 THEN Quadrangle = "MC-11: Oxia Palus";
32   IF LA >=   0 AND LA <   30 AND LO >= 315 AND LO <= 360 THEN Quadrangle = "MC-12: Arabia";
33   IF LA >=   0 AND LA <   30 AND LO >= 270 AND LO <  315 THEN Quadrangle = "MC-13: Syrtis Major";
34   IF LA >=   0 AND LA <   30 AND LO >= 225 AND LO <  270 THEN Quadrangle = "MC-14: Amenthes";
35   IF LA >=   0 AND LA <   30 AND LO >= 180 AND LO <  225 THEN Quadrangle = "MC-15: Elysium";
36   IF LA >= -30 AND LA <    0 AND LO >= 135 AND LO <  180 THEN Quadrangle = "MC-16: Memnonia";
37   IF LA >= -30 AND LA <    0 AND LO >=  90 AND LO <  135 THEN Quadrangle = "MC-17: Phoenicis Lacus";
38   IF LA >= -30 AND LA <    0 AND LO >=  45 AND LO <   90 THEN Quadrangle = "MC-18: Coprates";
39   IF LA >= -30 AND LA <    0 AND LO >=   0 AND LO <   45 THEN Quadrangle = "MC-19: Margaritifer Sinus";
40   IF LA >= -30 AND LA <    0 AND LO >= 315 AND LO <= 360 THEN Quadrangle = "MC-20: Sinus Sabaeus";
41   IF LA >= -30 AND LA <    0 AND LO >= 270 AND LO <  315 THEN Quadrangle = "MC-21: Iapygia";
42   IF LA >= -30 AND LA <    0 AND LO >= 225 AND LO <  270 THEN Quadrangle = "MC-22: Mare Tyrrhenum";
43   IF LA >= -30 AND LA <    0 AND LO >= 180 AND LO <  225 THEN Quadrangle = "MC-23: Aeolis";
44   IF LA >= -65 AND LA <  -30 AND LO >= 120 AND LO <  180 THEN Quadrangle = "MC-24: Phaethontis";
45   IF LA >= -65 AND LA <  -30 AND LO >=  60 AND LO <  120 THEN Quadrangle = "MC-25: Thaumasia";
46   IF LA >= -65 AND LA <  -30 AND LO >=   0 AND LO <   60 THEN Quadrangle = "MC-26: Argyre";
47   IF LA >= -65 AND LA <  -30 AND LO >= 300 AND LO <= 360 THEN Quadrangle = "MC-27: Noachis";
48   IF LA >= -65 AND LA <  -30 AND LO >= 240 AND LO <  300 THEN Quadrangle = "MC-28: Hellas";
49   IF LA >= -65 AND LA <  -30 AND LO >= 180 AND LO <  240 THEN Quadrangle = "MC-29: Eridania";
50   IF LA >= -90 AND LA <  -65 AND LO >=   0 AND LO <= 360 THEN Quadrangle = "MC-30: Mare Australe (South Pole)";
51 
52   LABEL Quadrangle             = "Quadrangle"
53         DIAM_CIRCLE_IMAGE      = "Diameter"
54         DEPTH_RIMFLOOR_TOPOG   = "Depth"
55         MorphoE1_RD            = "Morphology 1-RD"
56         NUMBER_LAYERS          = "Layers";
57 
58 RUN;
59 
60 ODS GRAPHICS ON;
61 
62 PROC HPSPLIT DATA = WORK SEED = 6587;
63   CLASS MorphoE1_RD Quadrangle;
64   MODEL MorphoE1_RD = Quadrangle
65                       DIAM_CIRCLE_IMAGE
66                       DEPTH_RIMFLOOR_TOPOG
67                       NUMBER_LAYERS;
68   GROW ENTROPY;
69   PRUNE COSTCOMPLEXITY;
70   TITLE "Mars' Craters - Decision Trees - Morphology = RD - Full Dataset";
71 RUN;
72 
73 TITLE;