Summary

Specialisation Data Analysis and Interpretation
Course Regression Modeling in Practice
Education Institution Wesleyan University
Publisher Coursera
Assignment Test a Multiple Regression Model

Introduction

Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allow scientists to put together a big jigsaw puzzle.

An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is known and the curiosity, or even need, to find out much more.

Summary

Although statistically significant the regression models tested for single and multiple variables only explain about 50% of the relation between the craters’ Depth and Layers when trying to explain Diameter. Either using both variables or trying quadratic models don’t improve significantly from the base Diameter by Depth association.

The Data Set

With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir was released on early Oct/15), the data set chosen for this assignment is the Mars Craters.

The Mars Craters Study, presents a global database that includes 378,540 Mars craters, with diameter of 1 km or larger, that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).

The data set was made available by Wesleyan University/Coursera as part of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.

Topic of Interest

The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.

Codebook

As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.

Variables

  • CRATER_ID: crater ID for internal sue, based upon the region of the planet (\({1 \over 16}\)), the “pass” under which the crate was identified, ad the order in which it was identified
  • LATITUDE_CIRCLE_IMAGE: latitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees North)
  • LONGITUDE_CIRCLE_IMAGE: longitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees East)
  • DIAM_CIRCLE_IMAGE: diameter from a non-linear least squares circle fit to the vertices selected to manually identify the crater rim (units are km)
  • DEPTH_RIMFLOOR_TOPOG: average elevation of each of the manually determined N points along (or inside) the crater rim (units are km)
    • Depth Rim: Points are selected as relative topographic highs under the assumption they are the least eroded so most original points along the rim
    • Depth Floor: Points were chosen as the lowest elevation that did not include visible embedded craters
  • MORPHOLOGY_EJECTA_1: ejecta morphology classified.
    • If there are multiple values, separated by a “/”, then the order is the inner-most ejecta through the outer-most, or the top-most through the bottom-most
  • MORPHOLOGY_EJECTA_2: the morphology of the layer(s) itself/themselves. This classification system is unique to this work.
  • MORPHOLOGY_EJECTA_3: overall texture and/or shape of some of the layer(s)/ejecta that are generally unique and deserve separate morphological classification.
  • NUMBER_LAYERS: the maximum number of cohesive layers in any azimuthal direction that could be reliably identified

Test a Multiple Regression Model

Variables

In the original full database:

  • Diameter (DIAM_CIRCLE_IMAGE) is a quantitative variable that ranges from 1.00 to 1,164.22. It will be used as the dependent variable in this test.
  • Depth (DEPTH_RIMFLOOR_TOPOG) is a quantitative variable that ranges from -0.42 to 4.95. It will be used as the explanatory variable in this test. As zero is a valid value in the range, it is not necessary to centre to the mean,
  • Layers (NUMBER_LAYERS) is a quantitative discrete variable that ranges from 0 to 5. It will be used as the explanatory variable in this test. As zero is a valid value in the range, it is not necessary to centre to the mean,

Sample

A sample of about one percent of the original data set, with 3,843 out of 378,540 cases, will be used for the regression.

Regression

A first try on one variable regression was attempted for both Depth and Layers to explain Diameter.

Depth

Depth can explain about 55% of Diameter with \(R{-}Square = 0.555726\). The p-value is significantly lower than the alpha value of five percent; \(\alpha = 0.05\) and \(p{-}value < 0.0001\).

\(Diameter = 1.75 + 21.4 * Depth\)

As per the regression model, for every extra km in Depth there would be an increase of 21.4 km in Diameter, considering the Sample Data.

This indicates a significant and positive association between Diameter and Depth.

Layers

Layers can explain about 3.5% of Diameter with \(R{-}Square = 0.035153\). The p-value is significantly lower than the alpha value of five percent; \(\alpha = 0.05\) and \(p{-}value < 0.0001\).

\(Diameter = 3.13 + 3.99 * Layers\)

As per the regression model, for every extra Layer there would be an increase of 3.99 km in Diameter, considering the Sample Data.

This indicates a significant and positive association between Diameter and Layers.

Depth and Layers

When combined in a liner regression both Depth and Layers can explain about 57.9% of Diameter with \(R{-}Square = 0.578792\). The p-value is significantly lower than the alpha value of five percent; \(\alpha = 0.05\) and \(p{-}value < 0.0001\) for both variables.

\(Diameter = 1.84 + 23.5 * Depth - 3.59 * Layers\)

This indicates a significant and positive association between Diameter and Depth , and a significant and negative association between Diameter and Layers.

Depth and Depth Squared

Using Depth with linear and quadratic terms can explain about 56% of Diameter with \(R{-}Square = 0.564372\). The p-value is significantly lower than the alpha value of five percent; \(\alpha = 0.05\) and \(p{-}value < 0.0001\).

\(Diameter = 1.89 + 16.57 * Depth + 3.91 * Depth^2\)

This indicates a significant and positive association between Diameter and Depth.

Charts

The charts are aligned with the numbers in showing a dispersion of points that, although statistically significant, do not conform adequately to a linear or quadratic regressions.

Residuals

The residuals are not randomly distributed and do not adjust well to a normal distribution, which can be seen also on the Q-Q Plot.

Leverage and Outliers

There is a considerable number of points that would change the regression line or curve.

Using SAS

SAS Output

Results: W03-Test a Basic Linear Regression Model-Local.sas

Results: W03-Test a Basic Linear Regression Model-Local.sas

The SURVEYSELECT Procedure

The SURVEYSELECT Procedure

Sample Selection Method

Selection Method Simple Random Sampling

Sample Selection Summary

Input Data Set WORK
Random Number Seed 196587
Sampling Rate 0.01
Sample Size 3844
Selection Probability 0.010001
Sampling Weight 99.98517
Output Data Set EXTRACT

Mars' Craters - Summary - Sample Data

The SUMMARY Procedure

The SUMMARY Procedure

Summary statistics

Variable Label N Mean Std Dev Minimum Maximum
Diameter
Depth
Depth2
Layers
Layers2
Diameter
Depth
Depth Squared
Layers
Layers Squared
3843
3843
3843
3843
3843
3.4079079
0.0774707
0.0592110
0.0694770
0.1017434
6.6251832
0.2307015
0.3257892
0.3113545
0.6000213
1.0000000
-0.0300000
0
0
0
104.6900000
2.4100000
5.8081000
3.0000000
9.0000000

The SGPLOT Procedure

The SGPlot Procedure

The SGPlot Procedure

Mars' Craters - GLM: Diameter by Depth - Sample Data

The GLM Procedure

The GLM Procedure

Data

Number of Observations

Number of Observations Read 3843
Number of Observations Used 3843

Mars' Craters - GLM: Diameter by Depth - Sample Data

The GLM Procedure

Dependent Variable: Diameter Diameter

Analysis of Variance

Diameter

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 1 93715.9556 93715.9556 4804.56 <.0001
Error 3841 74921.1526 19.5056    
Corrected Total 3842 168637.1082      

Fit Statistics

R-Square Coeff Var Root MSE Diameter Mean
0.555726 129.5962 4.416519 3.407908

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
Depth 1 93715.95561 93715.95561 4804.56 <.0001

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
Depth 1 93715.95561 93715.95561 4804.56 <.0001

Solution

Parameter Estimate Standard
Error
t Value Pr > |t| 95% Confidence Limits
Intercept 1.74940837 0.07515404 23.28 <.0001 1.60206272 1.89675401
Depth 21.40808023 0.30885243 69.31 <.0001 20.80254978 22.01361069

Diagnostics Panel

Panel of Fit Diagnostics for Diameter, which displays scatter plots of residuals, absolute residuals, studentized residuals, and observed responses by predicted values, studentized residuals by leverage, Cook's D by observation, a Q-Q plot of residuals, a residual histogram, and a residual-fit spread plot.

Fit Plot

Fit Plot for Diameter by Depth

Mars' Craters - GLM: Diameter by Layers - Sample Data

The GLM Procedure

The GLM Procedure

Data

Number of Observations

Number of Observations Read 3843
Number of Observations Used 3843

Mars' Craters - GLM: Diameter by Layers - Sample Data

The GLM Procedure

Dependent Variable: Diameter Diameter

Analysis of Variance

Diameter

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 1 5928.1150 5928.1150 139.94 <.0001
Error 3841 162708.9932 42.3611    
Corrected Total 3842 168637.1082      

Fit Statistics

R-Square Coeff Var Root MSE Diameter Mean
0.035153 190.9835 6.508541 3.407908

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
Layers 1 5928.115008 5928.115008 139.94 <.0001

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
Layers 1 5928.115008 5928.115008 139.94 <.0001

Solution

Parameter Estimate Standard
Error
t Value Pr > |t| 95% Confidence Limits
Intercept 3.130725643 0.10757294 29.10 <.0001 2.919820100 3.341631185
Layers 3.989555635 0.33724836 11.83 <.0001 3.328352649 4.650758621

Mars' Craters - GLM: Diameter by Depth and Layers - Sample Data

The GLM Procedure

The GLM Procedure

Data

Number of Observations

Number of Observations Read 3843
Number of Observations Used 3843

Mars' Craters - GLM: Diameter by Depth and Layers - Sample Data

The GLM Procedure

Dependent Variable: Diameter Diameter

Analysis of Variance

Diameter

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 2 97605.7261 48802.8631 2638.31 <.0001
Error 3840 71031.3821 18.4978    
Corrected Total 3842 168637.1082      

Fit Statistics

R-Square Coeff Var Root MSE Diameter Mean
0.578792 126.2036 4.300902 3.407908

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
Depth 1 93715.95561 93715.95561 5066.34 <.0001
Layers 1 3889.77049 3889.77049 210.28 <.0001

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
Depth 1 91677.61109 91677.61109 4956.15 <.0001
Layers 1 3889.77049 3889.77049 210.28 <.0001

Solution

Parameter Estimate Standard
Error
t Value Pr > |t| 95% Confidence Limits
Intercept 1.83554439 0.07342729 25.00 <.0001 1.69158417 1.97950461
Depth 23.51485901 0.33401830 70.40 <.0001 22.85998877 24.16972925
Layers -3.58895476 0.24749448 -14.50 <.0001 -4.07418797 -3.10372155

Mars' Craters - GLM: Diameter by Depth and Depth Squared - Sample Data

The GLM Procedure

The GLM Procedure

Data

Number of Observations

Number of Observations Read 3843
Number of Observations Used 3843

Mars' Craters - GLM: Diameter by Depth and Depth Squared - Sample Data

The GLM Procedure

Dependent Variable: Diameter Diameter

Analysis of Variance

Diameter

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 2 95174.0328 47587.0164 2487.43 <.0001
Error 3840 73463.0754 19.1310    
Corrected Total 3842 168637.1082      

Fit Statistics

R-Square Coeff Var Root MSE Diameter Mean
0.564372 128.3456 4.373901 3.407908

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
Depth 1 93715.95561 93715.95561 4898.64 <.0001
Depth2 1 1458.07719 1458.07719 76.22 <.0001

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
Depth 1 13117.05568 13117.05568 685.64 <.0001
Depth2 1 1458.07719 1458.07719 76.22 <.0001

Solution

Parameter Estimate Standard
Error
t Value Pr > |t| 95% Confidence Limits
Intercept 1.89246140 0.07621126 24.83 <.0001 1.74304298 2.04187981
Depth 16.57129011 0.63285917 26.18 <.0001 15.33051783 17.81206238
Depth2 3.91238933 0.44814726 8.73 <.0001 3.03375989 4.79101877

Mars' Craters - Regression - Sample Data

The REG Procedure

Model: MODEL1

Dependent Variable: Diameter Diameter

The REG Procedure

MODEL1

Fit

Diameter

Number of Observations

Number of Observations Read 3843
Number of Observations Used 3843

Analysis of Variance

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 93716 93716 4804.56 <.0001
Error 3841 74921 19.50564    
Corrected Total 3842 168637      

Fit Statistics

Root MSE 4.41652 R-Square 0.5557
Dependent Mean 3.40791 Adj R-Sq 0.5556
Coeff Var 129.59619    

Parameter Estimates

Parameter Estimates
Variable Label DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept Intercept 1 1.74941 0.07515 23.28 <.0001
Depth Depth 1 21.40808 0.30885 69.31 <.0001

Mars' Craters - Regression - Sample Data

The REG Procedure

Model: MODEL1

Dependent Variable: Diameter Diameter

Observation-wise Statistics

Diameter

Diagnostic Plots

Fit Diagnostics

Panel of fit diagnostics for Diameter.

Residual Plots

Depth

Scatter plot of residuals by Depth for Diameter.

Fit Plot

Scatterplot of Diameter by Depth overlaid with the fit line, a 95% confidence band and lower and upper 95% prediction limits.

Mars' Craters - Regression - Sample Data

The REG Procedure

Model: MODEL2

Dependent Variable: Diameter Diameter

MODEL2

Fit

Diameter

Number of Observations

Number of Observations Read 3843
Number of Observations Used 3843

Analysis of Variance

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 3 98321 32774 1789.31 <.0001
Error 3839 70316 18.31631    
Corrected Total 3842 168637      

Fit Statistics

Root MSE 4.27976 R-Square 0.5830
Dependent Mean 3.40791 Adj R-Sq 0.5827
Coeff Var 125.58308    

Parameter Estimates

Parameter Estimates
Variable Label DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept Intercept 1 1.93041 0.07463 25.87 <.0001
Depth Depth 1 19.88768 0.66893 29.73 <.0001
Depth2 Depth Squared 1 2.79146 0.44676 6.25 <.0001
Layers Layers 1 -3.28885 0.25092 -13.11 <.0001

Mars' Craters - Regression - Sample Data

The REG Procedure

Model: MODEL2

Dependent Variable: Diameter Diameter

Observation-wise Statistics

Diameter

Diagnostic Plots

Fit Diagnostics

Panel of fit diagnostics for Diameter.

Residual Plots

Panel 1

Panel of scatterplots of residuals by regressors for Diameter.

Mars' Craters - Regression - Sample Data

The REG Procedure

Model: MODEL2

Partial Regression Residual Plot

Partial Residual Plots

Diameter

Panel 1

Panel of partial regression scatterplots by regressors for Diameter.

SAS Code

  1 /* Using SAS Educational Virtual Machine running locally */
  2 /* For CSV Files uploaded from MacOS */
  3 FILENAME CSV     "/folders/myfolders/marscrater_pds.csv"
  4          TERMSTR = CRLF;
  5 
  6 PROC IMPORT DATAFILE = CSV
  7             OUT      = WORK
  8             DBMS     = CSV
  9             REPLACE;
 10 RUN;
 11 
 12 /* Unassign the file reference. */
 13 FILENAME CSV;
 14 
 15 /* Select a sample of 10% of the population */
 16 PROC SURVEYSELECT DATA     = WORK
 17                   OUT      = EXTRACT
 18                   METHOD   = SRS
 19                   SAMPRATE = 0.01
 20                   SEED     = 196587;
 21   ID DIAM_CIRCLE_IMAGE
 22      DEPTH_RIMFLOOR_TOPOG
 23      NUMBER_LAYERS;
 24 RUN;
 25 
 26 DATA SAMPLE;
 27   SET EXTRACT;
 28   /* remove outliers */
 29   WHERE DIAM_CIRCLE_IMAGE < 200;
 30   /* No need to Centre the Explanatory variables as they
 31      Already include a meaningful Zero */
 32 
 33   Diameter = DIAM_CIRCLE_IMAGE;
 34   Depth    = DEPTH_RIMFLOOR_TOPOG;
 35   Layers   = NUMBER_LAYERS;
 36 
 37   /* Create squared versions to check for quadratic regression */
 38   Depth2  = Depth  * Depth;
 39   Layers2 = Layers * Layers;
 40 
 41   LABEL Diameter = "Diameter"        /* Response    - Quantitative */
 42         Depth    = "Depth"           /* Explanatory - Quantitative */
 43         Layers   = "Layers"          /* Explanatory - Quantitative */
 44         Depth2   = "Depth Squared"
 45         Layers2  = "Layers Squared";
 46 RUN;
 47 
 48 PROC SUMMARY DATA = SAMPLE PRINT ;
 49   VAR   Diameter
 50         Depth
 51         Depth2
 52         Layers
 53         Layers2;
 54   TITLE "Mars' Craters - Summary - Sample Data";
 55 RUN;
 56 
 57 PROC SGPLOT DATA = SAMPLE ;
 58   REG   X = Depth
 59         Y = Diameter
 60   /     FILLEDOUTLINEDMARKERS
 61         MARKERATTRS        = (COLOR = GREY SIZE = 5 SYMBOL = CIRCLEFILLED)
 62         MARKERFILLATTRS    = (COLOR = GREY)
 63         MARKEROUTLINEATTRS = (COLOR = GREY THICKNESS = 1)
 64         LINEATTRS          = (COLOR = RED THICKNESS = 2) DEGREE = 1 CLM;
 65   REG   X = Depth
 66         Y = Diameter
 67   /     LINEATTRS          = (COLOR = GREEN THICKNESS = 2) DEGREE = 2 CLM;
 68   TITLE "Mars' Craters - Diameter by Depth - Sample Data";
 69   YAXIS LABEL = "Diameter";
 70   XAXIS LABEL = "Depth";
 71 RUN;
 72 
 73 PROC GLM
 74   DATA  = SAMPLE
 75   PLOTS = DIAGNOSTICS;
 76   MODEL Diameter = Depth / CLPARM;
 77   OUTPUT OUT = RESULTS RESIDUAL = Residual STUDENT = Student;
 78   TITLE "Mars' Craters - GLM: Diameter by Depth - Sample Data";
 79 RUN;
 80 
 81 PROC GLM
 82   DATA  = SAMPLE
 83   PLOTS = NONE;
 84   MODEL Diameter = Layers / CLPARM;
 85   TITLE "Mars' Craters - GLM: Diameter by Layers - Sample Data";
 86 RUN;
 87 
 88 PROC GLM
 89   DATA  = SAMPLE
 90   PLOTS = NONE;
 91   MODEL Diameter = Depth Layers / CLPARM;
 92   TITLE "Mars' Craters - GLM: Diameter by Depth and Layers - Sample Data";
 93 RUN;
 94 
 95 PROC GLM
 96   DATA  = SAMPLE
 97   PLOTS = NONE;
 98   MODEL Diameter = Depth Depth2 / CLPARM;
 99   TITLE "Mars' Craters - GLM: Diameter by Depth and Depth Squared - Sample Data";
100 RUN;
101 
102 PROC REG
103   DATA  = SAMPLE
104   PLOTS = PARTIAL;
105   MODEL Diameter = Depth;
106   MODEL Diameter = Depth Depth2 Layers / PARTIAL;
107   TITLE "Mars' Craters - Regression - Sample Data";
108 RUN;
109 
110 TITLE ;