Summary statistics
NUMBER_LAYERS | N Obs |
---|---|
0 | 364612 |
1 | 15467 |
2 | 3435 |
3 | 739 |
4 | 85 |
5 | 5 |
Specialisation | Data Analysis and Interpretation |
Course | Data Analysis Tools |
Education Institution | Wesleyan University |
Publisher | Coursera |
Assignment | Running a Chi-Square Test of Independence |
Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allows scientists to put together a big jigsaw puzzle.
An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is know and the curiosity, or even need, to find out much more.
With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir is showing starting early Oct/15), the data set chosen for this assignment is the Mars Craters.
The Mars Craters Study, presents a global database that includes over 300,000 Mars craters 1 km or larger that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).
The data set was made available by Wesleyan University/Coursera as part of the Data Management and Visualisation course, of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.
The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.
As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.
Are the Number of Layers of craters associated with the Hemisphere where they are located?
Given the distribution of number of layers changes drastically in numbers, the variable NUMBERS_LAYERS
will be reclassified as the variable Layers
, grouping the number of layers that are more or equal to 2
into a category of 2+
.
The Null Hypothesis (\(H_0\)) will consider that there is no association between number of layers of the craters and the hemisphere where they are located.
The Alternative Hypothesis (\(H_1\)) will consider that there is an association between number of layers of the craters and the hemisphere where they are located.
The test to identify an association between the number of layers of craters (categorical explanatory) and hemisphere (categorical response), chi-square test of independence in this case, shows that there is an association between the variables and the presence of 0, 1 or more layers in the craters is distinct between each hemisphere.
The association is determined by the resulting statistics: Chi-Squared of \(\chi^2 = 1037.7882\), degrees of freedom as \(df = 2\) and a probability of \({p{-}value} < .0001\).
The degree of freedom (df) is the number of levels of the explanatory variable - 1. In this case the df for Number of Layers has three levels.
As the null hypothesis was rejected due to the small p-value and given that the explanatory variable has more than two categories, it is not possible to identify how the variables are distinct between the possible combinations, therefore a post hoc analisys is necessary.
The first step is to adjust the Type I error \(\alpha = 0.05\) to control the family wise rate using the Bonferroni Adjustment, by dividing the limit by the number of pair combinations of the categories of the explanatory variable.
\[{\alpha \over \binom{c}{2}} = {0.05 \over \binom{3}{2}} = {0.05 \over 3} = 0.0167\]
Running the Chi-Square test pair wise demonstrates that each of the three combinations (0 and 1; 0 and 2+; 1 and 2+) are disctinct themselves.
0 | 1 | 2+ | |
---|---|---|---|
0 | - | ||
1 | 0.0001 | - | |
2+ | 0.0001 | 0.0001 | - |
/* Using SAS Educational Virtual Machine running locally */
/* For CSV Files uploaded from Unix/MacOS */
FILENAME CSV "/folders/myfolders/marscrater_pds.csv" TERMSTR=CRLF;
PROC IMPORT DATAFILE=CSV OUT=WORK DBMS=CSV REPLACE;
RUN;
/* Unassign the file reference. */
FILENAME CSV;
DATA WORK;
SET WORK;
/* Create Hemisphere */
IF LATITUDE_CIRCLE_IMAGE < 0 THEN
Hemisphere="South";
ELSE
Hemisphere="North";
/* Reduce Number of Layers (Group 2, 3, 4 and 5 into 2+) */
IF NUMBER_LAYERS=0 THEN
Layers="0 ";
ELSE IF NUMBER_LAYERS=1 THEN
Layers="1 ";
ELSE
Layers="2+";
RUN;
/* order the data by the craters'ID */
PROC SORT ; BY CRATER_ID; RUN;
/* Show the initial and grouped versions for Layers */
PROC SUMMARY PRINT; CLASS NUMBER_LAYERS; RUN;
PROC SUMMARY PRINT; CLASS Layers; RUN;
/* Calculate Chi-Squared */
PROC FREQ ;
TABLES Hemisphere * Layers / CHISQ;
RUN;
DATA COMPARISON1; SET WORK; IF Layers="0 " OR Layers="1 "; RUN;
PROC SORT ; BY CRATER_ID; RUN;
PROC FREQ ; TABLES Hemisphere * Layers / CHISQ; RUN;
DATA COMPARISON2; SET WORK; IF Layers="0 " OR Layers="2+"; RUN;
PROC SORT ; BY CRATER_ID; RUN;
PROC FREQ ; TABLES Hemisphere * Layers / CHISQ; RUN;
DATA COMPARISON3; SET WORK; IF Layers="1 " OR Layers="2+"; RUN;
PROC SORT ; BY CRATER_ID; RUN;
PROC FREQ ; TABLES Hemisphere * Layers / CHISQ; RUN;
The SUMMARY Procedure
NUMBER_LAYERS | N Obs |
---|---|
0 | 364612 |
1 | 15467 |
2 | 3435 |
3 | 739 |
4 | 85 |
5 | 5 |
The SUMMARY Procedure
Layers | N Obs |
---|---|
0 | 364612 |
1 | 15467 |
2+ | 4264 |
The FREQ Procedure
|
|
Statistics for Table of Hemisphere by Layers
Statistic | DF | Value | Prob |
---|---|---|---|
Chi-Square | 2 | 1037.7882 | <.0001 |
Likelihood Ratio Chi-Square | 2 | 1011.6248 | <.0001 |
Mantel-Haenszel Chi-Square | 1 | 1019.5820 | <.0001 |
Phi Coefficient | 0.0520 | ||
Contingency Coefficient | 0.0519 | ||
Cramer's V | 0.0520 |
Sample Size = 384343
The FREQ Procedure
|
|
Statistics for Table of Hemisphere by Layers
Statistic | DF | Value | Prob |
---|---|---|---|
Chi-Square | 1 | 361.7187 | <.0001 |
Likelihood Ratio Chi-Square | 1 | 355.4120 | <.0001 |
Continuity Adj. Chi-Square | 1 | 361.3987 | <.0001 |
Mantel-Haenszel Chi-Square | 1 | 361.7178 | <.0001 |
Phi Coefficient | -0.0308 | ||
Contingency Coefficient | 0.0308 | ||
Cramer's V | -0.0308 |
Fisher's Exact Test | |
---|---|
Cell (1,1) Frequency (F) | 141226 |
Left-sided Pr <= F | <.0001 |
Right-sided Pr >= F | 1.0000 |
Table Probability (P) | <.0001 |
Two-sided Pr <= P | <.0001 |
Sample Size = 380079
The FREQ Procedure
|
|
Statistics for Table of Hemisphere by Layers
Statistic | DF | Value | Prob |
---|---|---|---|
Chi-Square | 1 | 699.9715 | <.0001 |
Likelihood Ratio Chi-Square | 1 | 677.5512 | <.0001 |
Continuity Adj. Chi-Square | 1 | 699.1361 | <.0001 |
Mantel-Haenszel Chi-Square | 1 | 699.9697 | <.0001 |
Phi Coefficient | -0.0436 | ||
Contingency Coefficient | 0.0435 | ||
Cramer's V | -0.0436 |
Fisher's Exact Test | |
---|---|
Cell (1,1) Frequency (F) | 141226 |
Left-sided Pr <= F | <.0001 |
Right-sided Pr >= F | 1.0000 |
Table Probability (P) | <.0001 |
Two-sided Pr <= P | <.0001 |
Sample Size = 368876
The FREQ Procedure
|
|
Statistics for Table of Hemisphere by Layers
Statistic | DF | Value | Prob |
---|---|---|---|
Chi-Square | 1 | 200.9332 | <.0001 |
Likelihood Ratio Chi-Square | 1 | 201.5570 | <.0001 |
Continuity Adj. Chi-Square | 1 | 200.4430 | <.0001 |
Mantel-Haenszel Chi-Square | 1 | 200.9230 | <.0001 |
Phi Coefficient | -0.1009 | ||
Contingency Coefficient | 0.1004 | ||
Cramer's V | -0.1009 |
Fisher's Exact Test | |
---|---|
Cell (1,1) Frequency (F) | 7169 |
Left-sided Pr <= F | <.0001 |
Right-sided Pr >= F | 1.0000 |
Table Probability (P) | <.0001 |
Two-sided Pr <= P | <.0001 |
Sample Size = 19731