Summary

Specialisation Data Analysis and Interpretation
Course Data Analysis Tools
Education Institution Wesleyan University
Publisher Coursera
Assignment Generating a Correlation Coefficient

Introduction

Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allows scientists to put together a big jigsaw puzzle.

An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is known and the curiosity, or even need, to find out much more.

The Data Set

With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir is showing starting early Oct/15), the data set chosen for this assignment is the Mars Craters.

The Mars Craters Study, presents a global database that includes over 300,000 Mars craters 1 km or larger that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).

The data set was made available by Wesleyan University/Coursera as part of the Data Management and Visualisation course, of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.

Topic of Interest

The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.

Codebook

As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.

Variables

  • CRATER_ID: crater ID for internal sue, based upon the region of the planet (\({1 \over 16}\)), the “pass” under which the crate was identified, ad the order in which it was identified
  • LATITUDE_CIRCLE_IMAGE: latitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees North)
  • LONGITUDE_CIRCLE_IMAGE: longitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees East)
  • DIAM_CIRCLE_IMAGE: diameter from a non-linear least squares circle fit to the vertices selected to manually identify the crater rim (units are km)
  • DEPTH_RIMFLOOR_TOPOG: average elevation of each of the manually determined N points along (or inside) the crater rim (units are km)
    • Depth Rim: Points are selected as relative topographic highs under the assumption they are the least eroded so most original points along the rim
    • Depth Floor: Points were chosen as the lowest elevation that did not include visible embedded craters
  • MORPHOLOGY_EJECTA_1: ejecta morphology classified.
    • If there are multiple values, separated by a “/”, then the order is the inner-most ejecta through the outer-most, or the top-most through the bottom-most
  • MORPHOLOGY_EJECTA_2: the morphology of the layer(s) itself/themselves. This classification system is unique to this work.
  • MORPHOLOGY_EJECTA_3: overall texture and/or shape of some of the layer(s)/ejecta that are generally unique and deserve separate morphological classification.
  • NUMBER_LAYERS: the maximum number of cohesive layers in any azimuthal direction that could be reliably identified

Correlation

Are the Diameter of craters associated with their Depth?

Analysis using all the points

The Correlation Analysis using Person’s Coefficient based on the full dataset indicates a statistically significant result (\({p{-}value} < 0.0001\)). The Person Coefficient indicates a positive and moderate linear correlation (\(r = 0.5867\)). The data allow for the Diameter to explain about 34% of the Depth (\(r^2 = 0.3442\)).

Analysis eliminating outlier points

To facilitate the visualisation, a second analysis was made by eliminating craters with Diameter equal or greater than 400km and depth equal or greater than 4km. This action eliminated 13 observations out of 384,343 (0.0033%).

The Correlation Analysis using Person’s Coefficient based on the reduced dataset indicates a statistically significant result (\({p{-}value} < 0.0001\)). The Person Coefficient indicates a positive and moderate linear correlation (\(r = 0.6301\)). The data allow for the Diameter to explain about 39% of the Depth (\(r^2 = 0.3970\)).

Using SAS

SAS Code

/* Using SAS Educational Virtual Machine running locally */
/* For CSV Files uploaded from Unix/MacOS */
FILENAME CSV "/folders/myfolders/marscrater_pds.csv" TERMSTR=CRLF;

PROC IMPORT DATAFILE=CSV OUT=WORK DBMS=CSV REPLACE;
RUN;

/* Unassign the file reference. */
FILENAME CSV;

DATA WORK;
  SET   WORK;
  LABEL DIAM_CIRCLE_IMAGE    = "Diameter" 
        DEPTH_RIMFLOOR_TOPOG = "Depth";
RUN;

/* order the data by the craters'ID */
PROC SORT ;
  BY NUMBER_LAYERS;
RUN;

PROC SGPLOT ;
  TITLE   "Mars' Craters - Depth by Diameter";
  SCATTER X     = DIAM_CIRCLE_IMAGE 
          Y     = DEPTH_RIMFLOOR_TOPOG
  /       GROUP = NUMBER_LAYERS;
RUN;
    
PROC CORR ;
  VAR DIAM_CIRCLE_IMAGE 
      DEPTH_RIMFLOOR_TOPOG;
RUN;

DATA NO_OUTLIERS ;
  SET   WORK;
  WHERE DIAM_CIRCLE_IMAGE    < 400 AND
        DEPTH_RIMFLOOR_TOPOG <   4 ;
RUN;

PROC SGPLOT DATA = NO_OUTLIERS ;
  TITLE   "Mars' Craters - Depth by Diameter - Outliers removed";
  SCATTER X     = DIAM_CIRCLE_IMAGE 
          Y     = DEPTH_RIMFLOOR_TOPOG
  /       GROUP = NUMBER_LAYERS;
RUN;
    
PROC CORR DATA = NO_OUTLIERS ;
  VAR DIAM_CIRCLE_IMAGE 
      DEPTH_RIMFLOOR_TOPOG;
RUN;

SAS Output

Results: A03-02-SAS-Correlation-Local.sas

The SGPLOT Procedure

The SGPlot Procedure

The SGPlot Procedure

Mars' Craters - Depth by Diameter

The CORR Procedure

The CORR Procedure

Variables Information

2 Variables: DIAM_CIRCLE_IMAGE DEPTH_RIMFLOOR_TOPOG

Simple Statistics

Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum Label
DIAM_CIRCLE_IMAGE 384343 3.55669 8.59199 1366988 1.00000 1164 Diameter
DEPTH_RIMFLOOR_TOPOG 384343 0.07584 0.22152 29148 -0.42000 4.95000 Depth

Pearson Correlations

Pearson Correlation Coefficients, N = 384343
Prob > |r| under H0: Rho=0
  DIAM_CIRCLE_IMAGE DEPTH_RIMFLOOR_TOPOG
DIAM_CIRCLE_IMAGE
Diameter
1.00000
 
0.58671
<.0001
DEPTH_RIMFLOOR_TOPOG
Depth
0.58671
<.0001
1.00000
 

The SGPLOT Procedure

The SGPlot Procedure

The SGPlot Procedure

Mars' Craters - Depth by Diameter - Outliers removed

The CORR Procedure

The CORR Procedure

Variables Information

2 Variables: DIAM_CIRCLE_IMAGE DEPTH_RIMFLOOR_TOPOG

Simple Statistics

Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum Label
DIAM_CIRCLE_IMAGE 384330 3.54149 7.94177 1361102 1.00000 376.35000 Diameter
DEPTH_RIMFLOOR_TOPOG 384330 0.07577 0.22099 29122 -0.42000 3.80000 Depth

Pearson Correlations

Pearson Correlation Coefficients, N = 384330
Prob > |r| under H0: Rho=0
  DIAM_CIRCLE_IMAGE DEPTH_RIMFLOOR_TOPOG
DIAM_CIRCLE_IMAGE
Diameter
1.00000
 
0.63006
<.0001
DEPTH_RIMFLOOR_TOPOG
Depth
0.63006
<.0001
1.00000