Course Details

Specialisation Data Analysis and Interpretation
Course Machine Learning for Data Analysis
Education Institution Wesleyan University
Publisher Coursera
Assignment Running a Random Forest

Introduction

Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allow scientists to put together a big jigsaw puzzle.

An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is known and the curiosity, or even need, to find out much more.

Summary

When trying to identify an association of the physical and geographical characteristics of a crater with its morphology using a classification tree algorithm it is possible to identify a good number for some particular and more common formations. The identification does vary depending on the kind of formation and there are some characteristics that are common to some formations.

The Data Set

With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir was released on early Oct/15), the data set chosen for this assignment is the Mars Craters.

The Mars Craters Study, presents a global database that includes 378,540 Mars craters, with diameter of 1 km or larger, that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).

The data set was made available by Wesleyan University/Coursera as part of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.

Topic of Interest

The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.

Codebook

As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.

Variables

  • CRATER_ID: crater ID for internal sue, based upon the region of the planet (\({1 \over 16}\)), the “pass” under which the crate was identified, ad the order in which it was identified
  • LATITUDE_CIRCLE_IMAGE: latitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees North)
  • LONGITUDE_CIRCLE_IMAGE: longitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees East)
  • DIAM_CIRCLE_IMAGE: diameter from a non-linear least squares circle fit to the vertices selected to manually identify the crater rim (units are km)
  • DEPTH_RIMFLOOR_TOPOG: average elevation of each of the manually determined N points along (or inside) the crater rim (units are km)
    • Depth Rim: Points are selected as relative topographic highs under the assumption they are the least eroded so most original points along the rim
    • Depth Floor: Points were chosen as the lowest elevation that did not include visible embedded craters
  • MORPHOLOGY_EJECTA_1: ejecta morphology classified.
    • If there are multiple values, separated by a “/”, then the order is the inner-most ejecta through the outer-most, or the top-most through the bottom-most
  • MORPHOLOGY_EJECTA_2: the morphology of the layer(s) itself/themselves. This classification system is unique to this work.
  • MORPHOLOGY_EJECTA_3: overall texture and/or shape of some of the layer(s)/ejecta that are generally unique and deserve separate morphological classification.
  • NUMBER_LAYERS: the maximum number of cohesive layers in any azimuthal direction that could be reliably identified

Extra Variables

  • MorphoE1_RD is a categorical variable that has value “Yes” is the primary morphology is classified as “Rd” (Radial), and “No” otherwise.

  • Quadrangle is a variable derived from both LATITUDE_CIRCLE_IMAGE and LONGITUDE_CIRCLE_IMAGE variables. (see below a definition from Wikipedia)

Each Quadrangle has approximately from one to five percent of the recorded craters, being MC-16, Memnonia the one with the most observations (20455 = 5.32%), and MC-10: Lunae Palus the one with the lower number of records (3478 = 0.90%).

List of quadrangles on Mars (Wikipedia):

The surface of Mars has been divided into 30 quadrangles by the United States Geological Survey, so named because their borders lie along lines of latitude and longitude and so maps appear rectangular. Martian quadrangles are named after local features and are numbered with the prefix “MC” for “Mars Chart”. West longitude is used.

The following imagemap of the planet Mars is divided into 30 linked quadrangles. North is at the top; 0°N 180°W is at the far left on the equator. The map images were taken by the Mars Global Surveyor.

Running a Random Forest

Is the morphology of a crater strongly associate with its physical and geographical characteristics?

The Test

The variables Quadrangle, DIAM_CIRCLE_IMAGE, DEPTH_RIMFLOOR_TOPOG and NUMBER_LAYERS were used to classify the main characteristic of MORPHOLOGY_EJECTA_1 as “Rd” (Radial) or not (coded as “Yes” or “No” in the variable MorphoE1_RD).

The Results

The number of valid observations was 44625. Using a seed of 6587 the Misclassification Rate was 39.3%, conversely the Success Rate was 60.7%.

From the Fit Statistics Table, the Misclassification Rate on the Out Of Bag flattens around trees of size 55 or bigger.

The most important variable to contribute to the classification is NUMBER_LAYERS, with a Gini value of 0.3803. The remaining variables have a much less important contribution with Gini values lower than 0.0340.

Using SAS

SAS Output

Results: W02-Running a Random Forest.sas

Results: W02-Running a Random Forest.sas

Mars' Craters - Random Forest - Morphology = RD - Full Dataset

The HPFOREST Procedure

The HPFOREST Procedure

Performance Information

Performance Information
Execution Mode Single-Machine
Number of Threads 2

Data Access Information

Data Access Information
Data Engine Role Path
WORK.WORK V9 Input On Client

Model Information

Model Information
Parameter Value  
Variables to Try 2 (Default)
Maximum Trees 100 (Default)
Inbag Fraction 0.6 (Default)
Prune Fraction 0 (Default)
Prune Threshold 0.1 (Default)
Leaf Fraction 0.00001 (Default)
Leaf Size Setting 1 (Default)
Leaf Size Used 1  
Category Bins 30 (Default)
Interval Bins 100  
Minimum Category Size 5 (Default)
Node Size 100000 (Default)
Maximum Depth 20 (Default)
Alpha 1 (Default)
Exhaustive 5000 (Default)
Rows of Sequence to Skip 5 (Default)
Split Criterion . Gini
Preselection Method . Loh
Missing Value Handling . Valid value

Number of Observations

Number of Observations
Type N
Number of Observations Read 44625
Number of Observations Used 44625

Baseline Fit Statistics

Baseline Fit Statistics
Statistic Value
Average Square Error 0.239
Misclassification Rate 0.393
Log Loss 0.670

Fit Statistics

Fit Statistics
Number
of Trees
Number
of Leaves
Average
Square
Error
(Train)
Average
Square
Error
(OOB)
Misclassification
Rate
(Train)
Misclassification
Rate
(OOB)
Log
Loss
(Train)
Log
Loss
(OOB)
1 1416 0.0332 0.0602 0.0380 0.0631 0.4765 1.114
2 2753 0.0252 0.0571 0.0334 0.0613 0.1500 0.955
3 4207 0.0221 0.0551 0.0299 0.0603 0.0908 0.843
4 5500 0.0209 0.0536 0.0282 0.0605 0.0770 0.740
5 6999 0.0199 0.0515 0.0272 0.0588 0.0694 0.654
6 8251 0.0195 0.0497 0.0268 0.0579 0.0683 0.567
7 9642 0.0192 0.0486 0.0267 0.0574 0.0672 0.500
8 11087 0.0188 0.0476 0.0271 0.0563 0.0664 0.448
9 12626 0.0185 0.0468 0.0262 0.0557 0.0656 0.413
10 13979 0.0183 0.0463 0.0258 0.0557 0.0653 0.382
11 15387 0.0182 0.0457 0.0261 0.0554 0.0650 0.354
12 16826 0.0180 0.0453 0.0258 0.0550 0.0647 0.329
13 18297 0.0179 0.0447 0.0258 0.0544 0.0646 0.303
14 19459 0.0181 0.0443 0.0258 0.0539 0.0650 0.277
15 20844 0.0180 0.0441 0.0262 0.0536 0.0649 0.266
16 22289 0.0179 0.0438 0.0257 0.0533 0.0647 0.252
17 23689 0.0179 0.0436 0.0260 0.0533 0.0647 0.244
18 24937 0.0179 0.0433 0.0260 0.0531 0.0647 0.234
19 26271 0.0178 0.0430 0.0261 0.0529 0.0646 0.223
20 27532 0.0178 0.0429 0.0262 0.0528 0.0647 0.214
21 28887 0.0178 0.0427 0.0263 0.0526 0.0647 0.207
22 30227 0.0178 0.0426 0.0262 0.0525 0.0646 0.201
23 31685 0.0177 0.0424 0.0260 0.0521 0.0645 0.196
24 33151 0.0176 0.0423 0.0263 0.0523 0.0643 0.192
25 34606 0.0175 0.0422 0.0262 0.0522 0.0642 0.189
26 35996 0.0175 0.0421 0.0261 0.0520 0.0641 0.186
27 37452 0.0174 0.0420 0.0261 0.0515 0.0640 0.185
28 38797 0.0174 0.0418 0.0264 0.0515 0.0639 0.182
29 40193 0.0174 0.0417 0.0261 0.0513 0.0639 0.178
30 41461 0.0174 0.0416 0.0260 0.0513 0.0640 0.175
31 42961 0.0173 0.0415 0.0259 0.0511 0.0638 0.174
32 44415 0.0173 0.0415 0.0261 0.0509 0.0638 0.174
33 45833 0.0173 0.0414 0.0259 0.0510 0.0637 0.172
34 47055 0.0173 0.0413 0.0259 0.0508 0.0638 0.169
35 48521 0.0173 0.0413 0.0261 0.0510 0.0637 0.168
36 49946 0.0173 0.0413 0.0262 0.0512 0.0636 0.167
37 51352 0.0172 0.0412 0.0259 0.0511 0.0636 0.165
38 52598 0.0173 0.0412 0.0264 0.0509 0.0637 0.164
39 54005 0.0173 0.0412 0.0261 0.0509 0.0636 0.163
40 55372 0.0172 0.0411 0.0262 0.0509 0.0636 0.162
41 56815 0.0172 0.0411 0.0261 0.0507 0.0636 0.162
42 58209 0.0172 0.0410 0.0262 0.0508 0.0635 0.161
43 59521 0.0172 0.0410 0.0261 0.0507 0.0635 0.159
44 61023 0.0171 0.0410 0.0261 0.0507 0.0634 0.158
45 62456 0.0171 0.0409 0.0259 0.0507 0.0634 0.155
46 63746 0.0172 0.0409 0.0259 0.0507 0.0634 0.154
47 65133 0.0171 0.0408 0.0259 0.0505 0.0634 0.153
48 66517 0.0171 0.0408 0.0258 0.0506 0.0634 0.153
49 68010 0.0171 0.0408 0.0259 0.0506 0.0634 0.153
50 69318 0.0171 0.0408 0.0259 0.0505 0.0634 0.153
51 70666 0.0171 0.0407 0.0258 0.0505 0.0634 0.152
52 71933 0.0171 0.0407 0.0261 0.0503 0.0634 0.151
53 73318 0.0171 0.0407 0.0258 0.0503 0.0634 0.150
54 74511 0.0172 0.0407 0.0259 0.0501 0.0635 0.150
55 75875 0.0172 0.0407 0.0261 0.0502 0.0635 0.150
56 77296 0.0172 0.0407 0.0262 0.0498 0.0635 0.150
57 78805 0.0171 0.0407 0.0262 0.0500 0.0635 0.149
58 80183 0.0171 0.0406 0.0262 0.0502 0.0634 0.149
59 81623 0.0171 0.0406 0.0262 0.0502 0.0634 0.149
60 83060 0.0171 0.0406 0.0262 0.0501 0.0634 0.149
61 84451 0.0171 0.0406 0.0263 0.0501 0.0634 0.149
62 85821 0.0171 0.0406 0.0263 0.0500 0.0634 0.148
63 87227 0.0171 0.0406 0.0263 0.0499 0.0634 0.148
64 88676 0.0171 0.0405 0.0261 0.0500 0.0634 0.148
65 90043 0.0171 0.0405 0.0261 0.0499 0.0634 0.147
66 91297 0.0171 0.0405 0.0262 0.0500 0.0634 0.147
67 92713 0.0171 0.0405 0.0260 0.0498 0.0633 0.147
68 94120 0.0171 0.0404 0.0261 0.0497 0.0633 0.147
69 95453 0.0171 0.0404 0.0260 0.0497 0.0634 0.147
70 96707 0.0171 0.0405 0.0261 0.0496 0.0634 0.147
71 98029 0.0171 0.0404 0.0259 0.0496 0.0634 0.147
72 99532 0.0171 0.0404 0.0259 0.0496 0.0634 0.147
73 100836 0.0171 0.0404 0.0261 0.0498 0.0634 0.146
74 102162 0.0171 0.0404 0.0261 0.0496 0.0634 0.146
75 103619 0.0171 0.0404 0.0261 0.0497 0.0634 0.146
76 104964 0.0171 0.0404 0.0262 0.0496 0.0634 0.146
77 106365 0.0171 0.0404 0.0263 0.0498 0.0634 0.146
78 107697 0.0171 0.0404 0.0263 0.0499 0.0634 0.146
79 109079 0.0171 0.0404 0.0264 0.0499 0.0634 0.145
80 110558 0.0170 0.0404 0.0263 0.0497 0.0634 0.145
81 112049 0.0170 0.0404 0.0262 0.0497 0.0633 0.145
82 113297 0.0170 0.0403 0.0261 0.0496 0.0633 0.145
83 114748 0.0170 0.0403 0.0262 0.0495 0.0633 0.145
84 116097 0.0170 0.0403 0.0262 0.0495 0.0633 0.145
85 117523 0.0170 0.0403 0.0261 0.0497 0.0633 0.145
86 118913 0.0170 0.0403 0.0262 0.0497 0.0633 0.145
87 120325 0.0170 0.0403 0.0261 0.0495 0.0633 0.145
88 121774 0.0170 0.0403 0.0261 0.0496 0.0633 0.145
89 123193 0.0170 0.0403 0.0261 0.0495 0.0632 0.145
90 124542 0.0170 0.0403 0.0259 0.0497 0.0633 0.145
91 125852 0.0170 0.0403 0.0261 0.0498 0.0633 0.145
92 127229 0.0170 0.0402 0.0260 0.0497 0.0633 0.145
93 128529 0.0170 0.0402 0.0259 0.0496 0.0633 0.145
94 129897 0.0170 0.0402 0.0260 0.0496 0.0633 0.145
95 131210 0.0170 0.0402 0.0260 0.0497 0.0633 0.145
96 132595 0.0170 0.0402 0.0261 0.0495 0.0633 0.144
97 134016 0.0170 0.0402 0.0260 0.0495 0.0633 0.144
98 135293 0.0170 0.0402 0.0261 0.0494 0.0633 0.144
99 136605 0.0170 0.0402 0.0262 0.0495 0.0633 0.144
100 137914 0.0170 0.0402 0.0262 0.0495 0.0633 0.144

Loss Reduction Variable Importance

Loss Reduction Variable Importance
Variable Number
of Rules
Gini OOB
Gini
Margin OOB
Margin
NUMBER_LAYERS 10734 0.380336 0.37668 0.760671 0.757030
Quadrangle 18501 0.011123 0.00321 0.022245 0.015800
DIAM_CIRCLE_IMAGE 55843 0.034090 -0.01061 0.068181 0.023322
DEPTH_RIMFLOOR_TOPOG 52736 0.020689 -0.01432 0.041378 0.006347

SAS Code

 1 /* Use Course's Library */
 2 LIBNAME mydata "/courses/d1406ae5ba27fe300" ACCESS = readonly;
 3
 4 DATA WORK;
 5   SET mydata.marscrater_pds;
 6
 7   WHERE MORPHOLOGY_EJECTA_1 NE " ";
 8
 9   /* Collapse the Morphology of Eject 1 to its Main Feature, to reduce the output */
10   IF (INDEX(MORPHOLOGY_EJECTA_1, "/") = 0)
11     THEN MorphoE1 = MORPHOLOGY_EJECTA_1;
12     ELSE MorphoE1 = SUBSTR(MORPHOLOGY_EJECTA_1, 1, INDEX(MORPHOLOGY_EJECTA_1, "/") - 1);
13   MorphoE1 = UPCASE(TRIM(MorphoE1));
14
15   /* Does the the Morphology 1 equals to "RD" */
16   IF MorphoE1 = "RD"
17     THEN MorphoE1_RD = "Yes";
18     ELSE MorphoE1_RD = "No";
19
20   /* convert coordinates to Quadrangles: https://en.wikipedia.org/wiki/List_of_quadrangles_on_Mars */
21   LA = LATITUDE_CIRCLE_IMAGE;
22   LO = LONGITUDE_CIRCLE_IMAGE + 180;
23   IF LA >=  65 AND LA <=  90 AND LO >=   0 AND LO <= 360 THEN Quadrangle = "MC-01: Mare Boreum (North Pole)";
24   IF LA >=  30 AND LA <   65 AND LO >= 120 AND LO <  180 THEN Quadrangle = "MC-02: Diacria";
25   IF LA >=  30 AND LA <   65 AND LO >=  60 AND LO <  120 THEN Quadrangle = "MC-03: Arcadia";
26   IF LA >=  30 AND LA <   65 AND LO >=   0 AND LO <   60 THEN Quadrangle = "MC-04: Mare Acidalium";
27   IF LA >=  30 AND LA <   65 AND LO >= 300 AND LO <= 360 THEN Quadrangle = "MC-05: Ismenius Lacus";
28   IF LA >=  30 AND LA <   65 AND LO >= 240 AND LO <  300 THEN Quadrangle = "MC-06: Casius";
29   IF LA >=  30 AND LA <   65 AND LO >= 180 AND LO <  240 THEN Quadrangle = "MC-07: Cebrenia";
30   IF LA >=   0 AND LA <   30 AND LO >= 135 AND LO <  180 THEN Quadrangle = "MC-08: Amazonis";
31   IF LA >=   0 AND LA <   30 AND LO >=  90 AND LO <  135 THEN Quadrangle = "MC-09: Tharsis";
32   IF LA >=   0 AND LA <   30 AND LO >=  45 AND LO <   90 THEN Quadrangle = "MC-10: Lunae Palus";
33   IF LA >=   0 AND LA <   30 AND LO >=   0 AND LO <   45 THEN Quadrangle = "MC-11: Oxia Palus";
34   IF LA >=   0 AND LA <   30 AND LO >= 315 AND LO <= 360 THEN Quadrangle = "MC-12: Arabia";
35   IF LA >=   0 AND LA <   30 AND LO >= 270 AND LO <  315 THEN Quadrangle = "MC-13: Syrtis Major";
36   IF LA >=   0 AND LA <   30 AND LO >= 225 AND LO <  270 THEN Quadrangle = "MC-14: Amenthes";
37   IF LA >=   0 AND LA <   30 AND LO >= 180 AND LO <  225 THEN Quadrangle = "MC-15: Elysium";
38   IF LA >= -30 AND LA <    0 AND LO >= 135 AND LO <  180 THEN Quadrangle = "MC-16: Memnonia";
39   IF LA >= -30 AND LA <    0 AND LO >=  90 AND LO <  135 THEN Quadrangle = "MC-17: Phoenicis Lacus";
40   IF LA >= -30 AND LA <    0 AND LO >=  45 AND LO <   90 THEN Quadrangle = "MC-18: Coprates";
41   IF LA >= -30 AND LA <    0 AND LO >=   0 AND LO <   45 THEN Quadrangle = "MC-19: Margaritifer Sinus";
42   IF LA >= -30 AND LA <    0 AND LO >= 315 AND LO <= 360 THEN Quadrangle = "MC-20: Sinus Sabaeus";
43   IF LA >= -30 AND LA <    0 AND LO >= 270 AND LO <  315 THEN Quadrangle = "MC-21: Iapygia";
44   IF LA >= -30 AND LA <    0 AND LO >= 225 AND LO <  270 THEN Quadrangle = "MC-22: Mare Tyrrhenum";
45   IF LA >= -30 AND LA <    0 AND LO >= 180 AND LO <  225 THEN Quadrangle = "MC-23: Aeolis";
46   IF LA >= -65 AND LA <  -30 AND LO >= 120 AND LO <  180 THEN Quadrangle = "MC-24: Phaethontis";
47   IF LA >= -65 AND LA <  -30 AND LO >=  60 AND LO <  120 THEN Quadrangle = "MC-25: Thaumasia";
48   IF LA >= -65 AND LA <  -30 AND LO >=   0 AND LO <   60 THEN Quadrangle = "MC-26: Argyre";
49   IF LA >= -65 AND LA <  -30 AND LO >= 300 AND LO <= 360 THEN Quadrangle = "MC-27: Noachis";
50   IF LA >= -65 AND LA <  -30 AND LO >= 240 AND LO <  300 THEN Quadrangle = "MC-28: Hellas";
51   IF LA >= -65 AND LA <  -30 AND LO >= 180 AND LO <  240 THEN Quadrangle = "MC-29: Eridania";
52   IF LA >= -90 AND LA <  -65 AND LO >=   0 AND LO <= 360 THEN Quadrangle = "MC-30: Mare Australe (South Pole)";
53
54   LABEL Quadrangle             = "Quadrangle"
55         DIAM_CIRCLE_IMAGE      = "Diameter"
56         DEPTH_RIMFLOOR_TOPOG   = "Depth"
57         MorphoE1_RD            = "Morphology 1-RD"
58         NUMBER_LAYERS          = "Layers";
59
60 RUN;
61
62 ODS GRAPHICS ON;
63
64 PROC HPFOREST DATA = WORK SEED = 6587;
65   INPUT Quadrangle / LEVEL = BINARY;
66   INPUT DIAM_CIRCLE_IMAGE
67         DEPTH_RIMFLOOR_TOPOG
68         NUMBER_LAYERS / LEVEL = INTERVAL;
69   TARGET MorphoE1_RD / LEVEL = BINARY;
70   TITLE "Mars' Craters - Random Forest - Morphology = RD - Full Dataset";
71 RUN;
72
73 TITLE;