Sample Selection Method
Selection Method | Simple Random Sampling |
---|
Specialisation | Data Analysis and Interpretation |
Course | Regression Modeling in Practice |
Education Institution | Wesleyan University |
Publisher | Coursera |
Assignment | Test a Basic Linear Regression Model |
Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allow scientists to put together a big jigsaw puzzle.
An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is known and the curiosity, or even need, to find out much more.
With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir was released on early Oct/15), the data set chosen for this assignment is the Mars Craters.
The Mars Craters Study, presents a global database that includes 378,540 Mars craters, with diameter of 1 km or larger, that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).
The data set was made available by Wesleyan University/Coursera as part of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.
The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.
As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.
In the original full database:
DIAM_CIRCLE_IMAGE
) is a quantitative variable that ranges from 1.00 to 1,164.22. It will be used as the dependent variable in this test.DEPTH_RIMFLOOR_TOPOG
) is a quantitative variable that ranges from -0.42 to 4.95. It will be used as the explanatory variable in this test.A sample of about ten percent of the original data set, with 38,435 out of 378,540 cases, will be used for the regression.
There are two cases which could be considered outliers based on the numbers despite the correctness of the data. These two cases no not alter considerably the regression model when compared to the one produced without removing the outliers.
In both scenarios with and without the outliers the p-value is significatly lower than the alpha value of five percent; \(\alpha = 0.05\) and \(p{-}value < 0.0001\).
As per the regression model, for every extra km in Depth there would be an increase of 22.76 km in Diameter, considering the Sample Data.
This indicates a significant and positive association between Diameter and Depth.
\(Diameter = 3.61806635 + 22.76468952 * Depth\) (Sample Data)
\(Diameter = 3.57347150 + 22.83346222 * Depth\) (Outliers removed)
The SURVEYSELECT Procedure
Selection Method | Simple Random Sampling |
---|
Input Data Set | WORK |
---|---|
Random Number Seed | 196587 |
Sampling Rate | 0.1 |
Sample Size | 38435 |
Selection Probability | 0.100002 |
Sampling Weight | 9.999818 |
Output Data Set | EXTRACT |
The SUMMARY Procedure
Variable | Label | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|---|
DEPTH_RIMFLOOR_TOPOG
Depth
DIAM_CIRCLE_IMAGE
|
Depth
Diameter
|
38435
38435
38435
|
0.0756717
-3.46074E-16
3.6180663
|
0.2217400
0.2217400
10.3805219
|
-0.0100000
-0.0856717
1.0000000
|
3.6000000
3.5243283
1096.65
|
The GLM Procedure
Number of Observations Read | 38435 |
---|---|
Number of Observations Used | 38435 |
The GLM Procedure
Dependent Variable: DIAM_CIRCLE_IMAGE Diameter
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 1 | 979325.302 | 979325.302 | 11902.8 | <.0001 |
Error | 38433 | 3162139.359 | 82.277 | ||
Corrected Total | 38434 | 4141464.661 |
R-Square | Coeff Var | Root MSE | DIAM_CIRCLE_IMAGE Mean |
---|---|---|---|
0.236468 | 250.7043 | 9.070649 | 3.618066 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Depth | 1 | 979325.3017 | 979325.3017 | 11902.8 | <.0001 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Depth | 1 | 979325.3017 | 979325.3017 | 11902.8 | <.0001 |
Parameter | Estimate |
Standard Error |
t Value | Pr > |t| |
---|---|---|---|---|
Intercept | 3.61806635 | 0.04626738 | 78.20 | <.0001 |
Depth | 22.76468952 | 0.20865875 | 109.10 | <.0001 |
The SUMMARY Procedure
Variable | Label | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|---|
DEPTH_RIMFLOOR_TOPOG
Depth
DIAM_CIRCLE_IMAGE
|
Depth
Diameter
|
38433
38433
38433
|
0.0756756
-6.21191E-17
3.5734715
|
0.2217451
0.2217451
8.1634387
|
-0.0100000
-0.0856756
1.0000000
|
3.6000000
3.5243244
326.7700000
|
The GLM Procedure
Number of Observations Read | 38433 |
---|---|
Number of Observations Used | 38433 |
The GLM Procedure
Dependent Variable: DIAM_CIRCLE_IMAGE Diameter
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 1 | 985245.401 | 985245.401 | 24026.4 | <.0001 |
Error | 38431 | 1575929.644 | 41.007 | ||
Corrected Total | 38432 | 2561175.045 |
R-Square | Coeff Var | Root MSE | DIAM_CIRCLE_IMAGE Mean |
---|---|---|---|
0.384685 | 179.1997 | 6.403650 | 3.573471 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Depth | 1 | 985245.4011 | 985245.4011 | 24026.4 | <.0001 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Depth | 1 | 985245.4011 | 985245.4011 | 24026.4 | <.0001 |
Parameter | Estimate |
Standard Error |
t Value | Pr > |t| |
---|---|---|---|---|
Intercept | 3.57347150 | 0.03266446 | 109.40 | <.0001 |
Depth | 22.83346222 | 0.14730827 | 155.00 | <.0001 |
/* Using SAS Educational Virtual Machine running locally */
/* For CSV Files uploaded from MacOS */
FILENAME CSV "/folders/myfolders/marscrater_pds.csv"
TERMSTR = CRLF;
PROC IMPORT DATAFILE = CSV
OUT = WORK
DBMS = CSV
REPLACE;
RUN;
/* Unassign the file reference. */
FILENAME CSV;
DATA WORK;
SET WORK;
LABEL DEPTH_RIMFLOOR_TOPOG = "Depth" /* Explanatory - Quantitative */
DIAM_CIRCLE_IMAGE = "Diameter"; /* Response - Quantitative */
RUN;
/* Select a sample of 10% of the population */
PROC SURVEYSELECT DATA = WORK
OUT = EXTRACT
METHOD = SRS
SAMPRATE = 0.1
SEED = 196587;
ID DIAM_CIRCLE_IMAGE
DEPTH_RIMFLOOR_TOPOG;
RUN;
DATA SAMPLE ;
SET EXTRACT;
Depth = DEPTH_RIMFLOOR_TOPOG - 0.0756716534408713;
RUN;
PROC SUMMARY DATA = SAMPLE PRINT ;
VAR DEPTH_RIMFLOOR_TOPOG
Depth
DIAM_CIRCLE_IMAGE;
TITLE "Mars' Craters - Summary - Sample Data";
RUN;
PROC SGPLOT DATA = SAMPLE ;
REG X = Depth
Y = DIAM_CIRCLE_IMAGE
/ FILLEDOUTLINEDMARKERS
MARKERATTRS = (COLOR = GREY SIZE = 5 SYMBOL = CIRCLEFILLED)
MARKERFILLATTRS = (COLOR = GREY)
MARKEROUTLINEATTRS = (COLOR = GREY THICKNESS = 1)
LINEATTRS = (COLOR = RED THICKNESS = 2);
TITLE "Mars' Craters - Diameter by Depth - Sample Data";
YAXIS LABEL = "Diameter";
XAXIS LABEL = "Depth";
RUN;
PROC GLM DATA = SAMPLE ;
MODEL DIAM_CIRCLE_IMAGE = Depth;
TITLE "Mars' Craters - GLM - Sample Data";
RUN;
DATA NO_OUTLIERS ;
SET SAMPLE;
WHERE DIAM_CIRCLE_IMAGE < 500 AND
DEPTH_RIMFLOOR_TOPOG < 4 ;
Depth = DEPTH_RIMFLOOR_TOPOG - 0.075675591288738414;
RUN;
PROC SUMMARY DATA = NO_OUTLIERS PRINT ;
VAR DEPTH_RIMFLOOR_TOPOG
Depth
DIAM_CIRCLE_IMAGE;
TITLE "Mars' Craters - Summary - Outliers removed";
RUN;
PROC SGPLOT DATA = NO_OUTLIERS;
REG X = Depth
Y = DIAM_CIRCLE_IMAGE
/ FILLEDOUTLINEDMARKERS
MARKERATTRS = (COLOR = GREY SIZE = 5 SYMBOL = CIRCLEFILLED)
MARKERFILLATTRS = (COLOR = GREY)
MARKEROUTLINEATTRS = (COLOR = GREY THICKNESS = 1)
LINEATTRS = (COLOR = RED THICKNESS = 2);
TITLE "Mars' Craters - Diameter by Depth - Outliers removed";
YAXIS LABEL = "Diameter";
XAXIS LABEL = "Depth";
RUN;
PROC GLM DATA = NO_OUTLIERS;
MODEL DIAM_CIRCLE_IMAGE = Depth;
TITLE "Mars' Craters - GLM - Outliers removed";
RUN;
TITLE;