Summary

Specialisation Data Analysis and Interpretation
Course Data Analysis Tools
Education Institution Wesleyan University
Publisher Coursera
Assignment Running a Chi-Square Test of Independence

Introduction

Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allows scientists to put together a big jigsaw puzzle.

An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is know and the curiosity, or even need, to find out much more.

The Data Set

With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir is showing starting early Oct/15), the data set chosen for this assignment is the Mars Craters.

The Mars Craters Study, presents a global database that includes over 300,000 Mars craters 1 km or larger that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).

The data set was made available by Wesleyan University/Coursera as part of the Data Management and Visualisation course, of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.

Topic of Interest

The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.

Codebook

As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.

Variables

  • CRATER_ID: crater ID for internal sue, based upon the region of the planet (\({1 \over 16}\)), the “pass” under which the crate was identified, ad the order in which it was identified
  • LATITUDE_CIRCLE_IMAGE: latitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees North)
  • LONGITUDE_CIRCLE_IMAGE: longitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees East)
  • DIAM_CIRCLE_IMAGE: diameter from a non-linear least squares circle fit to the vertices selected to manually identify the crater rim (units are km)
  • DEPTH_RIMFLOOR_TOPOG: average elevation of each of the manually determined N points along (or inside) the crater rim (units are km)
    • Depth Rim: Points are selected as relative topographic highs under the assumption they are the least eroded so most original points along the rim
    • Depth Floor: Points were chosen as the lowest elevation that did not include visible embedded craters
  • MORPHOLOGY_EJECTA_1: ejecta morphology classified.
    • If there are multiple values, separated by a “/”, then the order is the inner-most ejecta through the outer-most, or the top-most through the bottom-most
  • MORPHOLOGY_EJECTA_2: the morphology of the layer(s) itself/themselves. This classification system is unique to this work.
  • MORPHOLOGY_EJECTA_3: overall texture and/or shape of some of the layer(s)/ejecta that are generally unique and deserve separate morphological classification.
  • NUMBER_LAYERS: the maximum number of cohesive layers in any azimuthal direction that could be reliably identified

Hypothesis

Are the Number of Layers of craters associated with the Hemisphere where they are located?

Number of Layers

Given the distribution of number of layers changes drastically in numbers, the variable NUMBERS_LAYERS will be reclassified as the variable Layers, grouping the number of layers that are more or equal to 2 into a category of 2+.

Formulation for Layers association with Hemisphere

The Null Hypothesis (\(H_0\)) will consider that there is no association between number of layers of the craters and the hemisphere where they are located.

The Alternative Hypothesis (\(H_1\)) will consider that there is an association between number of layers of the craters and the hemisphere where they are located.

  • \(H_0 : \mu_{layers_{north}} = \mu_{layers_{south}}\), Layers are not associated with Hemisphere
  • \(H_1 : \mu_{layers_{north}} \neq \mu_{layers_{south}}\), Layers are associated with Hemisphere

Model Interpretation for Chi-Square Tests

The test to identify an association between the number of layers of craters (categorical explanatory) and hemisphere (categorical response), chi-square test of independence in this case, shows that there is an association between the variables and the presence of 0, 1 or more layers in the craters is distinct between each hemisphere.

The association is determined by the resulting statistics: Chi-Squared of \(\chi^2 = 1037.7882\), degrees of freedom as \(df = 2\) and a probability of \({p{-}value} < .0001\).

The degree of freedom (df) is the number of levels of the explanatory variable - 1. In this case the df for Number of Layers has three levels.

Model Interpretation for post hoc Chi-Square Test results

As the null hypothesis was rejected due to the small p-value and given that the explanatory variable has more than two categories, it is not possible to identify how the variables are distinct between the possible combinations, therefore a post hoc analisys is necessary.

The first step is to adjust the Type I error \(\alpha = 0.05\) to control the family wise rate using the Bonferroni Adjustment, by dividing the limit by the number of pair combinations of the categories of the explanatory variable.

\[{\alpha \over \binom{c}{2}} = {0.05 \over \binom{3}{2}} = {0.05 \over 3} = 0.0167\]

Running the Chi-Square test pair wise demonstrates that each of the three combinations (0 and 1; 0 and 2+; 1 and 2+) are disctinct themselves.

0 1 2+
0 -
1 0.0001 -
2+ 0.0001 0.0001 -

Using SAS

SAS Code

/* Using SAS Educational Virtual Machine running locally */
/* For CSV Files uploaded from Unix/MacOS */
FILENAME CSV "/folders/myfolders/marscrater_pds.csv" TERMSTR=CRLF;

PROC IMPORT DATAFILE=CSV OUT=WORK DBMS=CSV REPLACE;
RUN;

/* Unassign the file reference.  */
FILENAME CSV;

DATA WORK;
    SET WORK;

    /* Create Hemisphere */
    IF LATITUDE_CIRCLE_IMAGE < 0 THEN
        Hemisphere="South";
    ELSE
        Hemisphere="North";

    /* Reduce Number of Layers (Group 2, 3, 4 and 5 into 2+) */
    IF NUMBER_LAYERS=0 THEN
        Layers="0 ";
    ELSE IF NUMBER_LAYERS=1 THEN
        Layers="1 ";
    ELSE
        Layers="2+";
RUN;

/* order the data by the craters'ID */
PROC SORT ; BY CRATER_ID; RUN;

/* Show the initial and grouped versions for Layers */
PROC SUMMARY PRINT; CLASS NUMBER_LAYERS; RUN;
PROC SUMMARY PRINT; CLASS Layers; RUN;

/* Calculate Chi-Squared */
PROC FREQ ;
    TABLES Hemisphere * Layers / CHISQ;
RUN;

DATA COMPARISON1; SET WORK; IF Layers="0 " OR Layers="1 "; RUN;
PROC SORT ; BY CRATER_ID; RUN;
PROC FREQ ; TABLES Hemisphere * Layers / CHISQ; RUN;

DATA COMPARISON2; SET WORK; IF Layers="0 " OR Layers="2+"; RUN;
PROC SORT ; BY CRATER_ID; RUN;
PROC FREQ ; TABLES Hemisphere * Layers / CHISQ; RUN;

DATA COMPARISON3; SET WORK; IF Layers="1 " OR Layers="2+"; RUN;
PROC SORT ; BY CRATER_ID; RUN;
PROC FREQ ; TABLES Hemisphere * Layers / CHISQ; RUN;

SAS Output

Results: A02-03-SAS-CHI2-Local.sas

Results: A02-03-SAS-CHI2-Local.sas

The SUMMARY Procedure

The SUMMARY Procedure

Summary statistics

NUMBER_LAYERS N Obs
0 364612
1 15467
2 3435
3 739
4 85
5 5

The SUMMARY Procedure

The SUMMARY Procedure

Summary statistics

Layers N Obs
0 364612
1 15467
2+ 4264

The FREQ Procedure

The FREQ Procedure

Table Hemisphere * Layers

Cross-Tabular Freq Table

Frequency
Percent
Row Pct
Col Pct
Table of Hemisphere by Layers
Hemisphere Layers
0 1 2+ Total
North
141226
36.74
93.59
38.73
7169
1.87
4.75
46.35
2499
0.65
1.66
58.61
150894
39.26
 
 
South
223386
58.12
95.69
61.27
8298
2.16
3.55
53.65
1765
0.46
0.76
41.39
233449
60.74
 
 
Total
364612
94.87
15467
4.02
4264
1.11
384343
100.00

Statistics for Table of Hemisphere by Layers

Chi-Square Tests

Statistic DF Value Prob
Chi-Square 2 1037.7882 <.0001
Likelihood Ratio Chi-Square 2 1011.6248 <.0001
Mantel-Haenszel Chi-Square 1 1019.5820 <.0001
Phi Coefficient   0.0520  
Contingency Coefficient   0.0519  
Cramer's V   0.0520  

Sample Size = 384343

The FREQ Procedure

The FREQ Procedure

Table Hemisphere * Layers

Cross-Tabular Freq Table

Frequency
Percent
Row Pct
Col Pct
Table of Hemisphere by Layers
Hemisphere Layers
0 1 Total
North
141226
37.16
95.17
38.73
7169
1.89
4.83
46.35
148395
39.04
 
 
South
223386
58.77
96.42
61.27
8298
2.18
3.58
53.65
231684
60.96
 
 
Total
364612
95.93
15467
4.07
380079
100.00

Statistics for Table of Hemisphere by Layers

Chi-Square Tests

Statistic DF Value Prob
Chi-Square 1 361.7187 <.0001
Likelihood Ratio Chi-Square 1 355.4120 <.0001
Continuity Adj. Chi-Square 1 361.3987 <.0001
Mantel-Haenszel Chi-Square 1 361.7178 <.0001
Phi Coefficient   -0.0308  
Contingency Coefficient   0.0308  
Cramer's V   -0.0308  

Fisher's Exact Test

Fisher's Exact Test
Cell (1,1) Frequency (F) 141226
Left-sided Pr <= F <.0001
Right-sided Pr >= F 1.0000
   
Table Probability (P) <.0001
Two-sided Pr <= P <.0001

Sample Size = 380079

The FREQ Procedure

The FREQ Procedure

Table Hemisphere * Layers

Cross-Tabular Freq Table

Frequency
Percent
Row Pct
Col Pct
Table of Hemisphere by Layers
Hemisphere Layers
0 2+ Total
North
141226
38.29
98.26
38.73
2499
0.68
1.74
58.61
143725
38.96
 
 
South
223386
60.56
99.22
61.27
1765
0.48
0.78
41.39
225151
61.04
 
 
Total
364612
98.84
4264
1.16
368876
100.00

Statistics for Table of Hemisphere by Layers

Chi-Square Tests

Statistic DF Value Prob
Chi-Square 1 699.9715 <.0001
Likelihood Ratio Chi-Square 1 677.5512 <.0001
Continuity Adj. Chi-Square 1 699.1361 <.0001
Mantel-Haenszel Chi-Square 1 699.9697 <.0001
Phi Coefficient   -0.0436  
Contingency Coefficient   0.0435  
Cramer's V   -0.0436  

Fisher's Exact Test

Fisher's Exact Test
Cell (1,1) Frequency (F) 141226
Left-sided Pr <= F <.0001
Right-sided Pr >= F 1.0000
   
Table Probability (P) <.0001
Two-sided Pr <= P <.0001

Sample Size = 368876

The FREQ Procedure

The FREQ Procedure

Table Hemisphere * Layers

Cross-Tabular Freq Table

Frequency
Percent
Row Pct
Col Pct
Table of Hemisphere by Layers
Hemisphere Layers
1 2+ Total
North
7169
36.33
74.15
46.35
2499
12.67
25.85
58.61
9668
49.00
 
 
South
8298
42.06
82.46
53.65
1765
8.95
17.54
41.39
10063
51.00
 
 
Total
15467
78.39
4264
21.61
19731
100.00

Statistics for Table of Hemisphere by Layers

Chi-Square Tests

Statistic DF Value Prob
Chi-Square 1 200.9332 <.0001
Likelihood Ratio Chi-Square 1 201.5570 <.0001
Continuity Adj. Chi-Square 1 200.4430 <.0001
Mantel-Haenszel Chi-Square 1 200.9230 <.0001
Phi Coefficient   -0.1009  
Contingency Coefficient   0.1004  
Cramer's V   -0.1009  

Fisher's Exact Test

Fisher's Exact Test
Cell (1,1) Frequency (F) 7169
Left-sided Pr <= F <.0001
Right-sided Pr >= F 1.0000
   
Table Probability (P) <.0001
Two-sided Pr <= P <.0001

Sample Size = 19731