Course Details

Specialisation Data Analysis and Interpretation
Course Machine Learning for Data Analysis
Education Institution Wesleyan University
Publisher Coursera
Assignment Running a Lasso Regression Analysis

Introduction

Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allow scientists to put together a big jigsaw puzzle.

An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is known and the curiosity, or even need, to find out much more.

Summary

When trying to identify an association of the physical and geographical characteristics of a crater with its morphology using a classification tree algorithm it is possible to identify a good number for some particular and more common formations. The identification does vary depending on the kind of formation and there are some characteristics that are common to some formations.

The Data Set

With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir was released on early Oct/15), the data set chosen for this assignment is the Mars Craters.

The Mars Craters Study, presents a global database that includes 378,540 Mars craters, with diameter of 1 km or larger, that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).

The data set was made available by Wesleyan University/Coursera as part of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.

Topic of Interest

The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.

Codebook

As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.

Variables

  • CRATER_ID: crater ID for internal sue, based upon the region of the planet (\({1 \over 16}\)), the “pass” under which the crate was identified, ad the order in which it was identified
  • LATITUDE_CIRCLE_IMAGE: latitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees North)
  • LONGITUDE_CIRCLE_IMAGE: longitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees East)
  • DIAM_CIRCLE_IMAGE: diameter from a non-linear least squares circle fit to the vertices selected to manually identify the crater rim (units are km)
  • DEPTH_RIMFLOOR_TOPOG: average elevation of each of the manually determined N points along (or inside) the crater rim (units are km)
    • Depth Rim: Points are selected as relative topographic highs under the assumption they are the least eroded so most original points along the rim
    • Depth Floor: Points were chosen as the lowest elevation that did not include visible embedded craters
  • MORPHOLOGY_EJECTA_1: ejecta morphology classified.
    • If there are multiple values, separated by a “/”, then the order is the inner-most ejecta through the outer-most, or the top-most through the bottom-most
  • MORPHOLOGY_EJECTA_2: the morphology of the layer(s) itself/themselves. This classification system is unique to this work.
  • MORPHOLOGY_EJECTA_3: overall texture and/or shape of some of the layer(s)/ejecta that are generally unique and deserve separate morphological classification.
  • NUMBER_LAYERS: the maximum number of cohesive layers in any azimuthal direction that could be reliably identified

Extra Variables

  • MorphoE1_RD is a categorical variable that has value “Yes” is the primary morphology is classified as “Rd” (Radial), and “No” otherwise.

  • Quadrangle is a variable derived from both LATITUDE_CIRCLE_IMAGE and LONGITUDE_CIRCLE_IMAGE variables. (see below a definition from Wikipedia)

Each Quadrangle has approximately from one to five percent of the recorded craters, being MC-16, Memnonia the one with the most observations (20455 = 5.32%), and MC-10: Lunae Palus the one with the lower number of records (3478 = 0.90%).

List of quadrangles on Mars (Wikipedia):

The surface of Mars has been divided into 30 quadrangles by the United States Geological Survey, so named because their borders lie along lines of latitude and longitude and so maps appear rectangular. Martian quadrangles are named after local features and are numbered with the prefix “MC” for “Mars Chart”. West longitude is used.

The following imagemap of the planet Mars is divided into 30 linked quadrangles. North is at the top; 0°N 180°W is at the far left on the equator. The map images were taken by the Mars Global Surveyor.

Running a Lasso Regression Analysis

Is the morphology of a crater strongly associate with its physical and geographical characteristics?

The Test

The variables Quadrangle (Categorical), DIAM_CIRCLE_IMAGE, DEPTH_RIMFLOOR_TOPOG and NUMBER_LAYERS (Quantitative) were used to classify the main characteristic of MORPHOLOGY_EJECTA_1 as “Rd” (Radial) or not (coded as 1-Yes or 0-No in the variable MorphoE1_RD).

The variable Quadrangle was split by SAS in its 30 categories.

The Results

The number of valid observations was 44625. Using a seed of 6587 the data set was divided in 31238 observations on Training and 13387 observations on Testing.

Using Cross Validation, the number of steps was 26 (including the Intercept). A visual and numeric analysis could arguably reduce it to five. Other methods would show similar split numbers and also similar charts supporting a smaller number.

The Mean Square Error for training and testing are visually superimposed, with no clear indication on what would be the best cut point. A visual inspection also supports a smaller number of variables, for the sake of simplicity.

Using the 25 variables, the model can explain about 57% (R-Square = 0.5773).

Using SAS

SAS Output

Results: W03-Running a Lasso Regression Analysis.sas

Results: W03-Running a Lasso Regression Analysis.sas

The SURVEYSELECT Procedure

The SURVEYSELECT Procedure

Sample Selection Method

Selection Method Simple Random Sampling

Sample Selection Summary

Input Data Set WORK
Random Number Seed 6587
Sampling Rate 0.7
Sample Size 31238
Selection Probability 0.700011
Sampling Weight 0
Output Data Set TRAINTEST

Mars' Craters - LASSO Regression - Morphology = RD

The GLMSELECT Procedure

The GLMSELECT Procedure

Model Information

Data Set WORK.TRAINTEST
Dependent Variable MorphoE1_RD
Selection Method LAR
Stop Criterion None
Choose Criterion Cross Validation
Cross Validation Method Random
Cross Validation Fold 10
Effect Hierarchy Enforced None
Random Number Seed 6587

Number of Observations

Number of Observations Read 44625
Number of Observations Used 44625
Number of Observations Used for Training 31238
Number of Observations Used for Testing 13387

Class Level Information

Class Level Information
Class Levels Values
MorphoE1_RD 2 0 1
Quadrangle 30 MC-01: Mare Boreum (North Pole) MC-02: Diacria MC-03: Arcadia MC-04: Mare Acidalium MC-05: Ismenius Lacus MC-06: Casius MC-07: Cebrenia MC-08: Amazonis MC-09: Tharsis MC-10: Lunae Palus MC-11: Oxia Palus MC-12: Arabia MC-13: Syrtis Major …

Dimensions

Dimensions
Number of Effects 5
Number of Effects after Splits 34
Number of Parameters 34

Mars' Craters - LASSO Regression - Morphology = RD

The GLMSELECT Procedure

Model Building Summary

LAR Selection Summary

LAR Selection Summary
Step Effect
Entered
Number
Effects In
ASE Test ASE CV PRESS
* Optimal Value of Criterion
0 Intercept 1 0.2386 0.2387 7454.6787
1 NUMBER_LAYERS 2 0.1090 0.1088 3304.3110
2 DIAM_CIRCLE_IMAGE 3 0.1051 0.1049 3199.1954
3 DEPTH_RIMFLOOR_TOPOG 4 0.1022 0.1020 3177.3827
4 Quadrangle_MC-01: Mare Boreum (North Pole) 5 0.1020 0.1019 3174.4301
5 Quadrangle_MC-07: Cebrenia 6 0.1018 0.1017 3170.1046
6 Quadrangle_MC-05: Ismenius Lacus 7 0.1018 0.1017 3166.7408
7 Quadrangle_MC-18: Coprates 8 0.1018 0.1016 3164.3432
8 Quadrangle_MC-21: Iapygia 9 0.1016 0.1015 3162.4652
9 Quadrangle_MC-04: Mare Acidalium 10 0.1015 0.1014 3161.6088
10 Quadrangle_MC-17: Phoenicis Lacus 11 0.1014 0.1013 3160.5560
11 Quadrangle_MC-08: Amazonis 12 0.1014 0.1013 3159.3801
12 Quadrangle_MC-02: Diacria 13 0.1014 0.1013 3158.2723
13 Quadrangle_MC-24: Phaethontis 14 0.1013 0.1012 3157.4473
14 Quadrangle_MC-20: Sinus Sabaeus 15 0.1013 0.1012 3156.7711
15 Quadrangle_MC-12: Arabia 16 0.1012 0.1011 3155.9961
16 Quadrangle_MC-06: Casius 17 0.1012 0.1011 3154.5947
17 Quadrangle_MC-19: Margaritifer Sinus 18 0.1012 0.1011 3154.5210
18 Quadrangle_MC-16: Memnonia 19 0.1011 0.1011 3154.0017
19 Quadrangle_MC-11: Oxia Palus 20 0.1011 0.1010 3153.6842
20 Quadrangle_MC-10: Lunae Palus 21 0.1010 0.1010 3153.7239
21 Quadrangle_MC-09: Tharsis 22 0.1010 0.1009 3153.4905
22 Quadrangle_MC-29: Eridania 23 0.1010 0.1009 3153.3422
23 Quadrangle_MC-03: Arcadia 24 0.1009 0.1009 3153.1830
24 Quadrangle_MC-28: Hellas 25 0.1009 0.1009 3153.0932
25 Quadrangle_MC-14: Amenthes 26 0.1009 0.1009 3153.0284*
26 Quadrangle_MC-30: Mare Australe (South Pol 27 0.1008 0.1009 3153.0699
27 Quadrangle_MC-26: Argyre 28 0.1008 0.1008 3153.2531
28 Quadrangle_MC-22: Mare Tyrrhenum 29 0.1008 0.1008 3153.2018
29 Quadrangle_MC-23: Aeolis 30 0.1008 0.1008 3153.1941
30 Quadrangle_MC-15: Elysium 31 0.1007 0.1008 3153.2564
31 Quadrangle_MC-25: Thaumasia 32 0.1007 0.1008 3153.3774
32 Quadrangle_MC-13: Syrtis Major 33 0.1007 0.1008 3153.5994

Stop Reason

Selection stopped because the change of the maximum absolute correction is tiny.

Coefficient Plot

Panel showing how the standardized coefficients and the CHOOSE= criterion change with the selection step.

Criterion Panel

Panel showing how selection criteria change with the selection step.

ASE Plot

Plot showing how the average squared error for the training, validation, and test data change with the selection step.

Selected Model

Mars' Craters - LASSO Regression - Morphology = RD

The GLMSELECT Procedure

Selected Model

The selected model, based on Cross Validation, is the model at Step 25.

Selected Effects

Effects: Intercept Quadrangle_MC-01: Mare Boreum (North Pole) Quadrangle_MC-02: Diacria Quadrangle_MC-03: Arcadia Quadrangle_MC-04: Mare Acidalium Quadrangle_MC-05: Ismenius Lacus Quadrangle_MC-06: Casius Quadrangle_MC-07: Cebrenia Quadrangle_MC-08: Amazonis Quadrangle_MC-09: Tharsis Quadrangle_MC-10: Lunae Palus Quadrangle_MC-11: Oxia Palus Quadrangle_MC-12: Arabia Quadrangle_MC-14: Amenthes Quadrangle_MC-16: Memnonia Quadrangle_MC-17: Phoenicis Lacus Quadrangle_MC-18: Coprates Quadrangle_MC-19: Margaritifer Sinus Quadrangle_MC-20: Sinus Sabaeus Quadrangle_MC-21: Iapygia Quadrangle_MC-24: Phaethontis Quadrangle_MC-28: Hellas Quadrangle_MC-29: Eridania DIAM_CIRCLE_IMAGE DEPTH_RIMFLOOR_TOPOG NUMBER_LAYERS

ANOVA

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value
Model 25 4303.36664 172.13467 1705.22
Error 31212 3150.72571 0.10095  
Corrected Total 31237 7454.09236    

Fit Statistics

Root MSE 0.31772
Dependent Mean 0.60666
R-Square 0.5773
Adj R-Sq 0.5770
AIC -40368
AICC -40368
SBC -71391
ASE (Train) 0.10086
ASE (Test) 0.10087
CV PRESS 3153.02838

Parameter Estimates

Parameter Estimates
Parameter DF Estimate
Intercept 1 0.826756
Quadrangle_MC-01: Mare Boreum (North Pole) 1 -0.038410
Quadrangle_MC-02: Diacria 1 0.027574
Quadrangle_MC-03: Arcadia 1 0.005809
Quadrangle_MC-04: Mare Acidalium 1 -0.034143
Quadrangle_MC-05: Ismenius Lacus 1 0.062124
Quadrangle_MC-06: Casius 1 0.024366
Quadrangle_MC-07: Cebrenia 1 0.053741
Quadrangle_MC-08: Amazonis 1 -0.019499
Quadrangle_MC-09: Tharsis 1 -0.007935
Quadrangle_MC-10: Lunae Palus 1 0.022122
Quadrangle_MC-11: Oxia Palus 1 -0.017943
Quadrangle_MC-12: Arabia 1 0.024276
Quadrangle_MC-14: Amenthes 1 0.000678
Quadrangle_MC-16: Memnonia 1 -0.010887
Quadrangle_MC-17: Phoenicis Lacus 1 -0.022572
Quadrangle_MC-18: Coprates 1 -0.060352
Quadrangle_MC-19: Margaritifer Sinus 1 -0.011520
Quadrangle_MC-20: Sinus Sabaeus 1 0.013952
Quadrangle_MC-21: Iapygia 1 0.028199
Quadrangle_MC-24: Phaethontis 1 0.019741
Quadrangle_MC-28: Hellas 1 0.001184
Quadrangle_MC-29: Eridania 1 -0.003254
DIAM_CIRCLE_IMAGE 1 0.002834
DEPTH_RIMFLOOR_TOPOG 1 0.090837
NUMBER_LAYERS 1 -0.512287

SAS Code

 1 /* Use Course's Library */
 2 LIBNAME mydata "/courses/d1406ae5ba27fe300" ACCESS = readonly;
 3
 4 DATA WORK;
 5   SET mydata.marscrater_pds;
 6
 7   WHERE MORPHOLOGY_EJECTA_1 NE " ";
 8
 9   /* Collapse the Morphology of Eject 1 to its Main Feature, to reduce the output */
10   IF (INDEX(MORPHOLOGY_EJECTA_1, "/") = 0)
11     THEN MorphoE1 = MORPHOLOGY_EJECTA_1;
12     ELSE MorphoE1 = SUBSTR(MORPHOLOGY_EJECTA_1, 1, INDEX(MORPHOLOGY_EJECTA_1, "/") - 1);
13   MorphoE1 = UPCASE(TRIM(MorphoE1));
14
15   /* Does the the Morphology 1 equals to "RD" */
16   IF MorphoE1 = "RD"
17     THEN MorphoE1_RD = 1;
18     ELSE MorphoE1_RD = 0;
19
20   /* convert coordinates to Quadrangles: https://en.wikipedia.org/wiki/List_of_quadrangles_on_Mars */
21   LA = LATITUDE_CIRCLE_IMAGE;
22   LO = LONGITUDE_CIRCLE_IMAGE + 180;
23   IF LA >=  65 AND LA <=  90 AND LO >=   0 AND LO <= 360 THEN Quadrangle = "MC-01: Mare Boreum (North Pole)";
24   IF LA >=  30 AND LA <   65 AND LO >= 120 AND LO <  180 THEN Quadrangle = "MC-02: Diacria";
25   IF LA >=  30 AND LA <   65 AND LO >=  60 AND LO <  120 THEN Quadrangle = "MC-03: Arcadia";
26   IF LA >=  30 AND LA <   65 AND LO >=   0 AND LO <   60 THEN Quadrangle = "MC-04: Mare Acidalium";
27   IF LA >=  30 AND LA <   65 AND LO >= 300 AND LO <= 360 THEN Quadrangle = "MC-05: Ismenius Lacus";
28   IF LA >=  30 AND LA <   65 AND LO >= 240 AND LO <  300 THEN Quadrangle = "MC-06: Casius";
29   IF LA >=  30 AND LA <   65 AND LO >= 180 AND LO <  240 THEN Quadrangle = "MC-07: Cebrenia";
30   IF LA >=   0 AND LA <   30 AND LO >= 135 AND LO <  180 THEN Quadrangle = "MC-08: Amazonis";
31   IF LA >=   0 AND LA <   30 AND LO >=  90 AND LO <  135 THEN Quadrangle = "MC-09: Tharsis";
32   IF LA >=   0 AND LA <   30 AND LO >=  45 AND LO <   90 THEN Quadrangle = "MC-10: Lunae Palus";
33   IF LA >=   0 AND LA <   30 AND LO >=   0 AND LO <   45 THEN Quadrangle = "MC-11: Oxia Palus";
34   IF LA >=   0 AND LA <   30 AND LO >= 315 AND LO <= 360 THEN Quadrangle = "MC-12: Arabia";
35   IF LA >=   0 AND LA <   30 AND LO >= 270 AND LO <  315 THEN Quadrangle = "MC-13: Syrtis Major";
36   IF LA >=   0 AND LA <   30 AND LO >= 225 AND LO <  270 THEN Quadrangle = "MC-14: Amenthes";
37   IF LA >=   0 AND LA <   30 AND LO >= 180 AND LO <  225 THEN Quadrangle = "MC-15: Elysium";
38   IF LA >= -30 AND LA <    0 AND LO >= 135 AND LO <  180 THEN Quadrangle = "MC-16: Memnonia";
39   IF LA >= -30 AND LA <    0 AND LO >=  90 AND LO <  135 THEN Quadrangle = "MC-17: Phoenicis Lacus";
40   IF LA >= -30 AND LA <    0 AND LO >=  45 AND LO <   90 THEN Quadrangle = "MC-18: Coprates";
41   IF LA >= -30 AND LA <    0 AND LO >=   0 AND LO <   45 THEN Quadrangle = "MC-19: Margaritifer Sinus";
42   IF LA >= -30 AND LA <    0 AND LO >= 315 AND LO <= 360 THEN Quadrangle = "MC-20: Sinus Sabaeus";
43   IF LA >= -30 AND LA <    0 AND LO >= 270 AND LO <  315 THEN Quadrangle = "MC-21: Iapygia";
44   IF LA >= -30 AND LA <    0 AND LO >= 225 AND LO <  270 THEN Quadrangle = "MC-22: Mare Tyrrhenum";
45   IF LA >= -30 AND LA <    0 AND LO >= 180 AND LO <  225 THEN Quadrangle = "MC-23: Aeolis";
46   IF LA >= -65 AND LA <  -30 AND LO >= 120 AND LO <  180 THEN Quadrangle = "MC-24: Phaethontis";
47   IF LA >= -65 AND LA <  -30 AND LO >=  60 AND LO <  120 THEN Quadrangle = "MC-25: Thaumasia";
48   IF LA >= -65 AND LA <  -30 AND LO >=   0 AND LO <   60 THEN Quadrangle = "MC-26: Argyre";
49   IF LA >= -65 AND LA <  -30 AND LO >= 300 AND LO <= 360 THEN Quadrangle = "MC-27: Noachis";
50   IF LA >= -65 AND LA <  -30 AND LO >= 240 AND LO <  300 THEN Quadrangle = "MC-28: Hellas";
51   IF LA >= -65 AND LA <  -30 AND LO >= 180 AND LO <  240 THEN Quadrangle = "MC-29: Eridania";
52   IF LA >= -90 AND LA <  -65 AND LO >=   0 AND LO <= 360 THEN Quadrangle = "MC-30: Mare Australe (South Pole)";
53
54   LABEL Quadrangle             = "Quadrangle"
55         DIAM_CIRCLE_IMAGE      = "Diameter"
56         DEPTH_RIMFLOOR_TOPOG   = "Depth"
57         MorphoE1_RD            = "Morphology 1-RD"
58         NUMBER_LAYERS          = "Layers";
59
60 RUN;
61
62 ODS GRAPHICS ON;
63
64 /* Split data randomly into test and training data */
65 PROC SURVEYSELECT DATA     = WORK
66                   OUT      = TRAINTEST
67                   METHOD   = SRS
68                   SAMPRATE = 0.7
69                   SEED     = 6587
70                   OUTALL;
71 RUN;
72
73 /* lasso multiple regression with lars algorithm k=10 fold validation */
74 PROC GLMSELECT DATA  = TRAINTEST
75                PLOTS = ALL
76                SEED  = 6587;
77   PARTITION ROLE    = SELECTED(TRAIN = '1'
78                                TEST  = '0');
79   CLASS MorphoE1_RD
80         Quadrangle;
81   MODEL MorphoE1_RD = Quadrangle
82                       DIAM_CIRCLE_IMAGE
83                       DEPTH_RIMFLOOR_TOPOG
84                       NUMBER_LAYERS
85   / SELECTION = LAR(CHOOSE = CV
86                     STOP   = NONE)
87     CVMETHOD  = RANDOM(10);
88   TITLE "Mars' Craters - LASSO Regression - Morphology = RD";
89 run;
90
91 TITLE;