Specialisation | Data Analysis and Interpretation |
Course | Data Analysis Tools |
Education Institution | Wesleyan University |
Publisher | Coursera |
Assignment | Generating a Correlation Coefficient |
Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allows scientists to put together a big jigsaw puzzle.
An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is known and the curiosity, or even need, to find out much more.
With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir is showing starting early Oct/15), the data set chosen for this assignment is the Mars Craters.
The Mars Craters Study, presents a global database that includes over 300,000 Mars craters 1 km or larger that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).
The data set was made available by Wesleyan University/Coursera as part of the Data Management and Visualisation course, of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.
The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.
As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.
Are the Diameter of craters associated with their Depth?
The Correlation Analysis using Person’s Coefficient based on the full dataset indicates a statistically significant result (\({p{-}value} < 0.0001\)). The Person Coefficient indicates a positive and moderate linear correlation (\(r = 0.5867\)). The data allow for the Diameter to explain about 34% of the Depth (\(r^2 = 0.3442\)).
To facilitate the visualisation, a second analysis was made by eliminating craters with Diameter equal or greater than 400km and depth equal or greater than 4km. This action eliminated 13 observations out of 384,343 (0.0033%).
The Correlation Analysis using Person’s Coefficient based on the reduced dataset indicates a statistically significant result (\({p{-}value} < 0.0001\)). The Person Coefficient indicates a positive and moderate linear correlation (\(r = 0.6301\)). The data allow for the Diameter to explain about 39% of the Depth (\(r^2 = 0.3970\)).
/* Using SAS Educational Virtual Machine running locally */
/* For CSV Files uploaded from Unix/MacOS */
FILENAME CSV "/folders/myfolders/marscrater_pds.csv" TERMSTR=CRLF;
PROC IMPORT DATAFILE=CSV OUT=WORK DBMS=CSV REPLACE;
RUN;
/* Unassign the file reference. */
FILENAME CSV;
DATA WORK;
SET WORK;
LABEL DIAM_CIRCLE_IMAGE = "Diameter"
DEPTH_RIMFLOOR_TOPOG = "Depth";
RUN;
/* order the data by the craters'ID */
PROC SORT ;
BY NUMBER_LAYERS;
RUN;
PROC SGPLOT ;
TITLE "Mars' Craters - Depth by Diameter";
SCATTER X = DIAM_CIRCLE_IMAGE
Y = DEPTH_RIMFLOOR_TOPOG
/ GROUP = NUMBER_LAYERS;
RUN;
PROC CORR ;
VAR DIAM_CIRCLE_IMAGE
DEPTH_RIMFLOOR_TOPOG;
RUN;
DATA NO_OUTLIERS ;
SET WORK;
WHERE DIAM_CIRCLE_IMAGE < 400 AND
DEPTH_RIMFLOOR_TOPOG < 4 ;
RUN;
PROC SGPLOT DATA = NO_OUTLIERS ;
TITLE "Mars' Craters - Depth by Diameter - Outliers removed";
SCATTER X = DIAM_CIRCLE_IMAGE
Y = DEPTH_RIMFLOOR_TOPOG
/ GROUP = NUMBER_LAYERS;
RUN;
PROC CORR DATA = NO_OUTLIERS ;
VAR DIAM_CIRCLE_IMAGE
DEPTH_RIMFLOOR_TOPOG;
RUN;
The CORR Procedure
2 Variables: | DIAM_CIRCLE_IMAGE DEPTH_RIMFLOOR_TOPOG |
---|
Simple Statistics | |||||||
---|---|---|---|---|---|---|---|
Variable | N | Mean | Std Dev | Sum | Minimum | Maximum | Label |
DIAM_CIRCLE_IMAGE | 384343 | 3.55669 | 8.59199 | 1366988 | 1.00000 | 1164 | Diameter |
DEPTH_RIMFLOOR_TOPOG | 384343 | 0.07584 | 0.22152 | 29148 | -0.42000 | 4.95000 | Depth |
Pearson Correlation Coefficients, N = 384343 Prob > |r| under H0: Rho=0 |
||
---|---|---|
DIAM_CIRCLE_IMAGE | DEPTH_RIMFLOOR_TOPOG | |
DIAM_CIRCLE_IMAGE
Diameter
|
1.00000
|
0.58671
<.0001
|
DEPTH_RIMFLOOR_TOPOG
Depth
|
0.58671
<.0001
|
1.00000
|
The CORR Procedure
2 Variables: | DIAM_CIRCLE_IMAGE DEPTH_RIMFLOOR_TOPOG |
---|
Simple Statistics | |||||||
---|---|---|---|---|---|---|---|
Variable | N | Mean | Std Dev | Sum | Minimum | Maximum | Label |
DIAM_CIRCLE_IMAGE | 384330 | 3.54149 | 7.94177 | 1361102 | 1.00000 | 376.35000 | Diameter |
DEPTH_RIMFLOOR_TOPOG | 384330 | 0.07577 | 0.22099 | 29122 | -0.42000 | 3.80000 | Depth |
Pearson Correlation Coefficients, N = 384330 Prob > |r| under H0: Rho=0 |
||
---|---|---|
DIAM_CIRCLE_IMAGE | DEPTH_RIMFLOOR_TOPOG | |
DIAM_CIRCLE_IMAGE
Diameter
|
1.00000
|
0.63006
<.0001
|
DEPTH_RIMFLOOR_TOPOG
Depth
|
0.63006
<.0001
|
1.00000
|