Summary

Specialisation Data Analysis and Interpretation
Course Regression Modeling in Practice
Education Institution Wesleyan University
Publisher Coursera
Assignment Test a Basic Linear Regression Model

Introduction

Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allow scientists to put together a big jigsaw puzzle.

An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is known and the curiosity, or even need, to find out much more.

The Data Set

With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir was released on early Oct/15), the data set chosen for this assignment is the Mars Craters.

The Mars Craters Study, presents a global database that includes 378,540 Mars craters, with diameter of 1 km or larger, that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).

The data set was made available by Wesleyan University/Coursera as part of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.

Topic of Interest

The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.

Codebook

As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.

Variables

  • CRATER_ID: crater ID for internal sue, based upon the region of the planet (\({1 \over 16}\)), the “pass” under which the crate was identified, ad the order in which it was identified
  • LATITUDE_CIRCLE_IMAGE: latitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees North)
  • LONGITUDE_CIRCLE_IMAGE: longitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees East)
  • DIAM_CIRCLE_IMAGE: diameter from a non-linear least squares circle fit to the vertices selected to manually identify the crater rim (units are km)
  • DEPTH_RIMFLOOR_TOPOG: average elevation of each of the manually determined N points along (or inside) the crater rim (units are km)
    • Depth Rim: Points are selected as relative topographic highs under the assumption they are the least eroded so most original points along the rim
    • Depth Floor: Points were chosen as the lowest elevation that did not include visible embedded craters
  • MORPHOLOGY_EJECTA_1: ejecta morphology classified.
    • If there are multiple values, separated by a “/”, then the order is the inner-most ejecta through the outer-most, or the top-most through the bottom-most
  • MORPHOLOGY_EJECTA_2: the morphology of the layer(s) itself/themselves. This classification system is unique to this work.
  • MORPHOLOGY_EJECTA_3: overall texture and/or shape of some of the layer(s)/ejecta that are generally unique and deserve separate morphological classification.
  • NUMBER_LAYERS: the maximum number of cohesive layers in any azimuthal direction that could be reliably identified

Test a Basic Linear Regression Model

Variables

In the original full database:

  • Diameter (DIAM_CIRCLE_IMAGE) is a quantitative variable that ranges from 1.00 to 1,164.22. It will be used as the dependent variable in this test.
  • Depth (DEPTH_RIMFLOOR_TOPOG) is a quantitative variable that ranges from -0.42 to 4.95. It will be used as the explanatory variable in this test.

Sample

A sample of about ten percent of the original data set, with 38,435 out of 378,540 cases, will be used for the regression.

Outliers

There are two cases which could be considered outliers based on the numbers despite the correctness of the data. These two cases no not alter considerably the regression model when compared to the one produced without removing the outliers.

Regression

In both scenarios with and without the outliers the p-value is significatly lower than the alpha value of five percent; \(\alpha = 0.05\) and \(p{-}value < 0.0001\).

As per the regression model, for every extra km in Depth there would be an increase of 22.76 km in Diameter, considering the Sample Data.

This indicates a significant and positive association between Diameter and Depth.

\(Diameter = 3.61806635 + 22.76468952 * Depth\) (Sample Data)

\(Diameter = 3.57347150 + 22.83346222 * Depth\) (Outliers removed)

Using SAS

SAS Output

Results: W02-Test a Basic Linear Regression Model-Local.sas

Results: W02-Test a Basic Linear Regression Model-Local.sas

The SURVEYSELECT Procedure

The SURVEYSELECT Procedure

Sample Selection Method

Selection Method Simple Random Sampling

Sample Selection Summary

Input Data Set WORK
Random Number Seed 196587
Sampling Rate 0.1
Sample Size 38435
Selection Probability 0.100002
Sampling Weight 9.999818
Output Data Set EXTRACT

Mars' Craters - Summary - Sample Data

The SUMMARY Procedure

The SUMMARY Procedure

Summary statistics

Variable Label N Mean Std Dev Minimum Maximum
DEPTH_RIMFLOOR_TOPOG
Depth
DIAM_CIRCLE_IMAGE
Depth
 
Diameter
38435
38435
38435
0.0756717
-3.46074E-16
3.6180663
0.2217400
0.2217400
10.3805219
-0.0100000
-0.0856717
1.0000000
3.6000000
3.5243283
1096.65

The SGPLOT Procedure

The SGPlot Procedure

The SGPlot Procedure

Mars' Craters - GLM - Sample Data

The GLM Procedure

The GLM Procedure

Data

Number of Observations

Number of Observations Read 38435
Number of Observations Used 38435

Mars' Craters - GLM - Sample Data

The GLM Procedure

Dependent Variable: DIAM_CIRCLE_IMAGE Diameter

Analysis of Variance

DIAM_CIRCLE_IMAGE

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 1 979325.302 979325.302 11902.8 <.0001
Error 38433 3162139.359 82.277    
Corrected Total 38434 4141464.661      

Fit Statistics

R-Square Coeff Var Root MSE DIAM_CIRCLE_IMAGE Mean
0.236468 250.7043 9.070649 3.618066

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
Depth 1 979325.3017 979325.3017 11902.8 <.0001

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
Depth 1 979325.3017 979325.3017 11902.8 <.0001

Solution

Parameter Estimate Standard
Error
t Value Pr > |t|
Intercept 3.61806635 0.04626738 78.20 <.0001
Depth 22.76468952 0.20865875 109.10 <.0001

Mars' Craters - Summary - Outliers removed

The SUMMARY Procedure

The SUMMARY Procedure

Summary statistics

Variable Label N Mean Std Dev Minimum Maximum
DEPTH_RIMFLOOR_TOPOG
Depth
DIAM_CIRCLE_IMAGE
Depth
 
Diameter
38433
38433
38433
0.0756756
-6.21191E-17
3.5734715
0.2217451
0.2217451
8.1634387
-0.0100000
-0.0856756
1.0000000
3.6000000
3.5243244
326.7700000

The SGPLOT Procedure

The SGPlot Procedure

The SGPlot Procedure

Mars' Craters - GLM - Outliers removed

The GLM Procedure

The GLM Procedure

Data

Number of Observations

Number of Observations Read 38433
Number of Observations Used 38433

Mars' Craters - GLM - Outliers removed

The GLM Procedure

Dependent Variable: DIAM_CIRCLE_IMAGE Diameter

Analysis of Variance

DIAM_CIRCLE_IMAGE

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 1 985245.401 985245.401 24026.4 <.0001
Error 38431 1575929.644 41.007    
Corrected Total 38432 2561175.045      

Fit Statistics

R-Square Coeff Var Root MSE DIAM_CIRCLE_IMAGE Mean
0.384685 179.1997 6.403650 3.573471

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
Depth 1 985245.4011 985245.4011 24026.4 <.0001

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
Depth 1 985245.4011 985245.4011 24026.4 <.0001

Solution

Parameter Estimate Standard
Error
t Value Pr > |t|
Intercept 3.57347150 0.03266446 109.40 <.0001
Depth 22.83346222 0.14730827 155.00 <.0001

SAS Code

/* Using SAS Educational Virtual Machine running locally */
/* For CSV Files uploaded from MacOS */
FILENAME CSV     "/folders/myfolders/marscrater_pds.csv" 
         TERMSTR = CRLF;

PROC IMPORT DATAFILE = CSV
            OUT      = WORK
            DBMS     = CSV
            REPLACE;
RUN;

/* Unassign the file reference. */
FILENAME CSV;

DATA WORK;
  SET   WORK;

  LABEL DEPTH_RIMFLOOR_TOPOG = "Depth"     /* Explanatory - Quantitative */
        DIAM_CIRCLE_IMAGE    = "Diameter"; /* Response    - Quantitative */
RUN;

/* Select a sample of 10% of the population */
PROC SURVEYSELECT DATA     = WORK 
                  OUT      = EXTRACT 
                  METHOD   = SRS 
                  SAMPRATE = 0.1 
                  SEED     = 196587;
  ID DIAM_CIRCLE_IMAGE 
     DEPTH_RIMFLOOR_TOPOG;
RUN;

DATA SAMPLE ;
  SET   EXTRACT;
  Depth = DEPTH_RIMFLOOR_TOPOG - 0.0756716534408713;
RUN;

PROC SUMMARY DATA = SAMPLE PRINT ;
  VAR   DEPTH_RIMFLOOR_TOPOG
        Depth
        DIAM_CIRCLE_IMAGE;
  TITLE "Mars' Craters - Summary - Sample Data";
RUN;

PROC SGPLOT DATA = SAMPLE ;
  REG   X = Depth
        Y = DIAM_CIRCLE_IMAGE
  /     FILLEDOUTLINEDMARKERS
        MARKERATTRS        = (COLOR = GREY SIZE = 5 SYMBOL = CIRCLEFILLED)
        MARKERFILLATTRS    = (COLOR = GREY)
        MARKEROUTLINEATTRS = (COLOR = GREY THICKNESS = 1)
        LINEATTRS          = (COLOR = RED THICKNESS = 2);
  TITLE "Mars' Craters - Diameter by Depth - Sample Data";
  YAXIS LABEL = "Diameter";
  XAXIS LABEL = "Depth";
RUN;

PROC GLM DATA = SAMPLE ;
  MODEL DIAM_CIRCLE_IMAGE = Depth;
  TITLE "Mars' Craters - GLM - Sample Data";
RUN;

DATA NO_OUTLIERS ;
  SET   SAMPLE;

  WHERE DIAM_CIRCLE_IMAGE    < 500 AND
        DEPTH_RIMFLOOR_TOPOG <   4 ;

  Depth = DEPTH_RIMFLOOR_TOPOG - 0.075675591288738414;
RUN;

PROC SUMMARY DATA = NO_OUTLIERS PRINT ;
  VAR   DEPTH_RIMFLOOR_TOPOG 
        Depth 
        DIAM_CIRCLE_IMAGE;
  TITLE "Mars' Craters - Summary - Outliers removed";
RUN;

PROC SGPLOT DATA = NO_OUTLIERS;
  REG   X = Depth  
        Y = DIAM_CIRCLE_IMAGE
  /     FILLEDOUTLINEDMARKERS
        MARKERATTRS        = (COLOR = GREY SIZE = 5 SYMBOL = CIRCLEFILLED)
        MARKERFILLATTRS    = (COLOR = GREY)
        MARKEROUTLINEATTRS = (COLOR = GREY THICKNESS = 1)
        LINEATTRS          = (COLOR = RED THICKNESS = 2);
  TITLE "Mars' Craters - Diameter by Depth - Outliers removed";
  YAXIS LABEL = "Diameter";
  XAXIS LABEL = "Depth";
RUN;

PROC GLM DATA = NO_OUTLIERS;
  MODEL DIAM_CIRCLE_IMAGE = Depth;
  TITLE "Mars' Craters - GLM - Outliers removed";
RUN;
TITLE;