Summary

Specialisation Data Analysis and Interpretation
Course Data Analysis Tools
Education Institution Wesleyan University
Publisher Coursera
Assignment Run an analysis of variance

Introduction

Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allows scientists to put together a big jigsaw puzzle.

An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is know and the curiosity, or even need, to find out much more.

The Data Set

With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir is showing starting early Oct/15), the data set chosen for this assignment is the Mars Craters.

The Mars Craters Study, presents a global database that includes over 300,000 Mars craters 1 km or larger that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).

The data set was made available by Wesleyan University/Coursera as part of the Data Management and Visualisation course, of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.

Topic of Interest

The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.

Codebook

As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.

Variables

  • CRATER_ID: crater ID for internal sue, based upon the region of the planet (\({1 \over 16}\)), the “pass” under which the crate was identified, ad the order in which it was identified
  • LATITUDE_CIRCLE_IMAGE: latitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees North)
  • LONGITUDE_CIRCLE_IMAGE: longitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees East)
  • DIAM_CIRCLE_IMAGE: diameter from a non-linear least squares circle fit to the vertices selected to manually identify the crater rim (units are km)
  • DEPTH_RIMFLOOR_TOPOG: average elevation of each of the manually determined N points along (or inside) the crater rim (units are km)
    • Depth Rim: Points are selected as relative topographic highs under the assumption they are the least eroded so most original points along the rim
    • Depth Floor: Points were chosen as the lowest elevation that did not include visible embedded craters
  • MORPHOLOGY_EJECTA_1: ejecta morphology classified.
    • If there are multiple values, separated by a “/”, then the order is the inner-most ejecta through the outer-most, or the top-most through the bottom-most
  • MORPHOLOGY_EJECTA_2: the morphology of the layer(s) itself/themselves. This classification system is unique to this work.
  • MORPHOLOGY_EJECTA_3: overall texture and/or shape of some of the layer(s)/ejecta that are generally unique and deserve separate morphological classification.
  • NUMBER_LAYERS: the maximum number of cohesive layers in any azimuthal direction that could be reliably identified

Hypothesis

Are the diameter and depth of bigger craters associated with the Hemisphere where they are located?

Bigger Craters

For this analysis, Larger Diameters is of craters with more than 50 km wide, and Bigger Depth is of craters with more than 1 km deep.

Formulation for Diameter association with Hemisphere

The Null Hypothesis (\(H_0\)) will consider that there is no association between the diameter of big craters and the hemisphere where they are located.

The Alternative Hypothesis (\(H_1\)) will consider that there is an association between the diameter of big craters and the hemisphere where they are located.

  • \(H_0 : \mu_{diameter_{north}} = \mu_{diameter_{south}}\), Diameter is NOT associated with Hemisphere
  • \(H_1 : \mu_{diameter_{north}} \neq \mu_{diameter_{south}}\), Diameter is associated with Hemisphere

Model Interpretation for ANOVA

The Analysis of Variance (ANOVA) results, when examining the association between Diameter of big craters (quantitative response) and Hemisphere location (categorical explanatory), showed that among bigger craters, there is no statistical significant difference, indicating that big craters are distributed mostly the same across both hemispheres. In terms of the hypothesis proposition, the conclusion is to fail to reject the Null Hypothesis given that p-value was quite high (\({p-value} = Pr > F = 0.8399\)). The statistic for diameter in the North hemisphere are \(\bar{x} = 79.4222378\) and \(s = 43.0653131\) for a total of \(572\) observations. Comparatively, the diameter in the South hemisphere has \(\bar{x} = 79.9679706\) and \(s = 58.8595601\) for a total of \(1498\) observations.

Model Interpretation for post hoc ANOVA results

As there are only two categories in Hemispheres (North and South), it is not necessary to conduct an ANOVA’s post hoc analysis.

Formulation for Depth association with Hemisphere

The Null Hypothesis (\(H_0\)) will consider that there is no association between the depth of big craters and the hemisphere where they are located.

The Alternative Hypothesis (\(H_1\)) will consider that there is an association between the depth of big craters and the hemisphere where they are located.

  • \(H_0 : \mu_{depth_{north}} = \mu_{depth_{south}}\), Depth is NOT associated with Hemisphere
  • \(H_1 : \mu_{depth_{north}} \neq \mu_{depth_{south}}\), Depth is associated with Hemisphere

Model Interpretation for ANOVA

The Analysis of Variance (ANOVA) results, when examining the association between Depth of big craters (quantitative response) and Hemisphere location (categorical explanatory), showed that among bigger craters, there is no statistical significant difference, indicating that deep craters are distributed mostly the same across both hemispheres. In terms of the hypothesis proposition, the conclusion is to fail to reject the Null Hypothesis given that p-value was quite high (\({p-value} = Pr > F = 0.3360\)). The statistics for diameter in the North hemisphere are \(\bar{x} = 1.37797973\) and \(s = 0.35730327\) for a total of \(1480\) observations. Comparatively, the diameter in the South hemisphere has \(\bar{x} = 1.38903846\) and \(s = 0.37248995\) for a total of \(3328\) observations.

Model Interpretation for post hoc ANOVA results

As there are only two categories in Hemispheres (North and South), it is not necessary to conduct an ANOVA’s post hoc analysis.

Using SAS

SAS Code

/* Using SAS Educational Virtual Machine running locally */

/* For CSV Files uploaded from Unix/MacOS */
FILENAME CSV "/folders/myfolders/marscrater_pds.csv" TERMSTR=LF;
PROC IMPORT DATAFILE=CSV
            OUT=WORK
            DBMS=CSV
            REPLACE;
RUN;

DATA WORK;                   /* Configure the Data */
  SET WORK;                  /* Data set */
  N_LAYERS = VAR10;          /* Fixing variable identification by SAS */
  IF LATITUDE_CIRCLE_IMAGE < 0
    THEN Hemisphere = "South";
    ELSE Hemisphere = "North";
  IF DIAM_CIRCLE_IMAGE > 50
    THEN LARGE_DIAMETER = DIAM_CIRCLE_IMAGE;
    ELSE LARGE_DIAMETER = .;
  IF DEPTH_RIMFLOOR_TOPOG > 1 
    THEN BIG_DEPTH = DEPTH_RIMFLOOR_TOPOG;
    ELSE BIG_DEPTH = .;
  LABEL Hemisphere     = "Hemisphere"
        N_LAYERS       = "Layers"
        LARGE_DIAMETER = "Diameter"
        BIG_DEPTH      = "Depth";
RUN;

/* order the data by the craters'ID */
PROC SORT;
  BY CRATER_ID;
RUN;

/* Show some basic statistics */
/* for Diameter and Depth by Layers */
PROC MEANS;
  VARIABLES LARGE_DIAMETER BIG_DEPTH;
RUN;

PROC MEANS;
  CLASS     Hemisphere;
  VARIABLES LARGE_DIAMETER BIG_DEPTH;
RUN;

/* Calculate ANOVA */
/* Calculate Post Hoc of ANOVA for Diameter on Hemisphere */
PROC ANOVA;
  CLASS Hemisphere;
  MODEL LARGE_DIAMETER = Hemisphere;
  MEANS Hemisphere;
RUN;

/* Calculate Post Hoc of ANOVA for Depth on Hemisphere */
PROC ANOVA;
  CLASS Hemisphere;
  MODEL BIG_DEPTH = Hemisphere;
  MEANS Hemisphere;
RUN;

/* Unassign the file reference.  */
FILENAME CSV;

SAS Output

The MEANS Procedure

Summary statistics

Variable Label N Mean Std Dev Minimum Maximum
LARGE_DIAMETER
BIG_DEPTH
Diameter
Depth
2070
4808
79.8171691
1.3856344
54.9413781
0.3678803
50.0100000
1.0100000
1164.22
4.9500000

The MEANS Procedure

Summary statistics

Hemisphere N Obs Variable Label N Mean Std Dev Minimum Maximum
North 150894
LARGE_DIAMETER
BIG_DEPTH
Diameter
Depth
572
1480
79.4222378
1.3779797
43.0653131
0.3573033
50.0300000
1.0100000
408.2300000
3.1400000
South 233449
LARGE_DIAMETER
BIG_DEPTH
Diameter
Depth
1498
3328
79.9679706
1.3890385
58.8595601
0.3724899
50.0100000
1.0100000
1164.22
4.9500000

The ANOVA Procedure

Data

Class Levels

Class Level Information
Class Levels Values
Hemisphere 2 North South

Number of Observations

Number of Observations Read 384343
Number of Observations Used 2070

Analysis of Variance

LARGE_DIAMETER

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 1 123.281 123.281 0.04 0.8399
Error 2068 6245267.081 3019.955    
Corrected Total 2069 6245390.362      

Fit Statistics

R-Square Coeff Var Root MSE LARGE_DIAMETER Mean
0.000020 68.85000 54.95412 79.81717

Anova Model ANOVA

Source DF Anova SS Mean Square F Value Pr > F
Hemisphere 1 123.2814444 123.2814444 0.04 0.8399

Box Plot

Distribution of LARGE_DIAMETER by Hemisphere


Means

Level of
Hemisphere
N LARGE_DIAMETER
Mean Std Dev
North 572 79.4222378 43.0653131
South 1498 79.9679706 58.8595601

The ANOVA Procedure

Data

Class Levels

Class Level Information
Class Levels Values
Hemisphere 2 North South

Number of Observations

Number of Observations Read 384343
Number of Observations Used 4808

Analysis of Variance

BIG_DEPTH

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 1 0.1252827 0.1252827 0.93 0.3360
Error 4806 650.4345825 0.1353380    
Corrected Total 4807 650.5598652      

Fit Statistics

R-Square Coeff Var Root MSE BIG_DEPTH Mean
0.000193 26.54980 0.367883 1.385634

Anova Model ANOVA

Source DF Anova SS Mean Square F Value Pr > F
Hemisphere 1 0.12528274 0.12528274 0.93 0.3360

Box Plot

Distribution of BIG_DEPTH by Hemisphere


Means

Level of
Hemisphere
N BIG_DEPTH
Mean Std Dev
North 1480 1.37797973 0.35730327
South 3328 1.38903846 0.37248995