Sample Selection Method
Selection Method | Simple Random Sampling |
---|
Specialisation | Data Analysis and Interpretation |
Course | Regression Modeling in Practice |
Education Institution | Wesleyan University |
Publisher | Coursera |
Assignment | Test a Multiple Regression Model |
Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allow scientists to put together a big jigsaw puzzle.
An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is known and the curiosity, or even need, to find out much more.
Although statistically significant the regression models tested for single and multiple variables only explain about 50% of the relation between the craters’ Depth and Layers when trying to explain Diameter. Either using both variables or trying quadratic models don’t improve significantly from the base Diameter by Depth association.
With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir was released on early Oct/15), the data set chosen for this assignment is the Mars Craters.
The Mars Craters Study, presents a global database that includes 378,540 Mars craters, with diameter of 1 km or larger, that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).
The data set was made available by Wesleyan University/Coursera as part of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.
The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.
As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.
In the original full database:
DIAM_CIRCLE_IMAGE
) is a quantitative variable that ranges from 1.00 to 1,164.22. It will be used as the dependent variable in this test.DEPTH_RIMFLOOR_TOPOG
) is a quantitative variable that ranges from -0.42 to 4.95. It will be used as the explanatory variable in this test. As zero is a valid value in the range, it is not necessary to centre to the mean,NUMBER_LAYERS
) is a quantitative discrete variable that ranges from 0 to 5. It will be used as the explanatory variable in this test. As zero is a valid value in the range, it is not necessary to centre to the mean,A sample of about one percent of the original data set, with 3,843 out of 378,540 cases, will be used for the regression.
A first try on one variable regression was attempted for both Depth and Layers to explain Diameter.
Depth can explain about 55% of Diameter with \(R{-}Square = 0.555726\). The p-value is significantly lower than the alpha value of five percent; \(\alpha = 0.05\) and \(p{-}value < 0.0001\).
\(Diameter = 1.75 + 21.4 * Depth\)
As per the regression model, for every extra km in Depth there would be an increase of 21.4 km in Diameter, considering the Sample Data.
This indicates a significant and positive association between Diameter and Depth.
Layers can explain about 3.5% of Diameter with \(R{-}Square = 0.035153\). The p-value is significantly lower than the alpha value of five percent; \(\alpha = 0.05\) and \(p{-}value < 0.0001\).
\(Diameter = 3.13 + 3.99 * Layers\)
As per the regression model, for every extra Layer there would be an increase of 3.99 km in Diameter, considering the Sample Data.
This indicates a significant and positive association between Diameter and Layers.
When combined in a liner regression both Depth and Layers can explain about 57.9% of Diameter with \(R{-}Square = 0.578792\). The p-value is significantly lower than the alpha value of five percent; \(\alpha = 0.05\) and \(p{-}value < 0.0001\) for both variables.
\(Diameter = 1.84 + 23.5 * Depth - 3.59 * Layers\)
This indicates a significant and positive association between Diameter and Depth , and a significant and negative association between Diameter and Layers.
Using Depth with linear and quadratic terms can explain about 56% of Diameter with \(R{-}Square = 0.564372\). The p-value is significantly lower than the alpha value of five percent; \(\alpha = 0.05\) and \(p{-}value < 0.0001\).
\(Diameter = 1.89 + 16.57 * Depth + 3.91 * Depth^2\)
This indicates a significant and positive association between Diameter and Depth.
The charts are aligned with the numbers in showing a dispersion of points that, although statistically significant, do not conform adequately to a linear or quadratic regressions.
The residuals are not randomly distributed and do not adjust well to a normal distribution, which can be seen also on the Q-Q Plot.
There is a considerable number of points that would change the regression line or curve.
The SURVEYSELECT Procedure
Selection Method | Simple Random Sampling |
---|
Input Data Set | WORK |
---|---|
Random Number Seed | 196587 |
Sampling Rate | 0.01 |
Sample Size | 3844 |
Selection Probability | 0.010001 |
Sampling Weight | 99.98517 |
Output Data Set | EXTRACT |
The SUMMARY Procedure
Variable | Label | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|---|
Diameter
Depth
Depth2
Layers
Layers2
|
Diameter
Depth
Depth Squared
Layers
Layers Squared
|
3843
3843
3843
3843
3843
|
3.4079079
0.0774707
0.0592110
0.0694770
0.1017434
|
6.6251832
0.2307015
0.3257892
0.3113545
0.6000213
|
1.0000000
-0.0300000
0
0
0
|
104.6900000
2.4100000
5.8081000
3.0000000
9.0000000
|
The GLM Procedure
Number of Observations Read | 3843 |
---|---|
Number of Observations Used | 3843 |
The GLM Procedure
Dependent Variable: Diameter Diameter
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 1 | 93715.9556 | 93715.9556 | 4804.56 | <.0001 |
Error | 3841 | 74921.1526 | 19.5056 | ||
Corrected Total | 3842 | 168637.1082 |
R-Square | Coeff Var | Root MSE | Diameter Mean |
---|---|---|---|
0.555726 | 129.5962 | 4.416519 | 3.407908 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Depth | 1 | 93715.95561 | 93715.95561 | 4804.56 | <.0001 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Depth | 1 | 93715.95561 | 93715.95561 | 4804.56 | <.0001 |
Parameter | Estimate |
Standard Error |
t Value | Pr > |t| | 95% Confidence Limits | |
---|---|---|---|---|---|---|
Intercept | 1.74940837 | 0.07515404 | 23.28 | <.0001 | 1.60206272 | 1.89675401 |
Depth | 21.40808023 | 0.30885243 | 69.31 | <.0001 | 20.80254978 | 22.01361069 |
The GLM Procedure
Number of Observations Read | 3843 |
---|---|
Number of Observations Used | 3843 |
The GLM Procedure
Dependent Variable: Diameter Diameter
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 1 | 5928.1150 | 5928.1150 | 139.94 | <.0001 |
Error | 3841 | 162708.9932 | 42.3611 | ||
Corrected Total | 3842 | 168637.1082 |
R-Square | Coeff Var | Root MSE | Diameter Mean |
---|---|---|---|
0.035153 | 190.9835 | 6.508541 | 3.407908 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Layers | 1 | 5928.115008 | 5928.115008 | 139.94 | <.0001 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Layers | 1 | 5928.115008 | 5928.115008 | 139.94 | <.0001 |
Parameter | Estimate |
Standard Error |
t Value | Pr > |t| | 95% Confidence Limits | |
---|---|---|---|---|---|---|
Intercept | 3.130725643 | 0.10757294 | 29.10 | <.0001 | 2.919820100 | 3.341631185 |
Layers | 3.989555635 | 0.33724836 | 11.83 | <.0001 | 3.328352649 | 4.650758621 |
The GLM Procedure
Number of Observations Read | 3843 |
---|---|
Number of Observations Used | 3843 |
The GLM Procedure
Dependent Variable: Diameter Diameter
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 2 | 97605.7261 | 48802.8631 | 2638.31 | <.0001 |
Error | 3840 | 71031.3821 | 18.4978 | ||
Corrected Total | 3842 | 168637.1082 |
R-Square | Coeff Var | Root MSE | Diameter Mean |
---|---|---|---|
0.578792 | 126.2036 | 4.300902 | 3.407908 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Depth | 1 | 93715.95561 | 93715.95561 | 5066.34 | <.0001 |
Layers | 1 | 3889.77049 | 3889.77049 | 210.28 | <.0001 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Depth | 1 | 91677.61109 | 91677.61109 | 4956.15 | <.0001 |
Layers | 1 | 3889.77049 | 3889.77049 | 210.28 | <.0001 |
Parameter | Estimate |
Standard Error |
t Value | Pr > |t| | 95% Confidence Limits | |
---|---|---|---|---|---|---|
Intercept | 1.83554439 | 0.07342729 | 25.00 | <.0001 | 1.69158417 | 1.97950461 |
Depth | 23.51485901 | 0.33401830 | 70.40 | <.0001 | 22.85998877 | 24.16972925 |
Layers | -3.58895476 | 0.24749448 | -14.50 | <.0001 | -4.07418797 | -3.10372155 |
The GLM Procedure
Number of Observations Read | 3843 |
---|---|
Number of Observations Used | 3843 |
The GLM Procedure
Dependent Variable: Diameter Diameter
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 2 | 95174.0328 | 47587.0164 | 2487.43 | <.0001 |
Error | 3840 | 73463.0754 | 19.1310 | ||
Corrected Total | 3842 | 168637.1082 |
R-Square | Coeff Var | Root MSE | Diameter Mean |
---|---|---|---|
0.564372 | 128.3456 | 4.373901 | 3.407908 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Depth | 1 | 93715.95561 | 93715.95561 | 4898.64 | <.0001 |
Depth2 | 1 | 1458.07719 | 1458.07719 | 76.22 | <.0001 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Depth | 1 | 13117.05568 | 13117.05568 | 685.64 | <.0001 |
Depth2 | 1 | 1458.07719 | 1458.07719 | 76.22 | <.0001 |
Parameter | Estimate |
Standard Error |
t Value | Pr > |t| | 95% Confidence Limits | |
---|---|---|---|---|---|---|
Intercept | 1.89246140 | 0.07621126 | 24.83 | <.0001 | 1.74304298 | 2.04187981 |
Depth | 16.57129011 | 0.63285917 | 26.18 | <.0001 | 15.33051783 | 17.81206238 |
Depth2 | 3.91238933 | 0.44814726 | 8.73 | <.0001 | 3.03375989 | 4.79101877 |
The REG Procedure
Model: MODEL1
Dependent Variable: Diameter Diameter
Number of Observations Read | 3843 |
---|---|
Number of Observations Used | 3843 |
Analysis of Variance | |||||
---|---|---|---|---|---|
Source | DF |
Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 1 | 93716 | 93716 | 4804.56 | <.0001 |
Error | 3841 | 74921 | 19.50564 | ||
Corrected Total | 3842 | 168637 |
Root MSE | 4.41652 | R-Square | 0.5557 |
---|---|---|---|
Dependent Mean | 3.40791 | Adj R-Sq | 0.5556 |
Coeff Var | 129.59619 |
Parameter Estimates | ||||||
---|---|---|---|---|---|---|
Variable | Label | DF |
Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | Intercept | 1 | 1.74941 | 0.07515 | 23.28 | <.0001 |
Depth | Depth | 1 | 21.40808 | 0.30885 | 69.31 | <.0001 |
The REG Procedure
Model: MODEL1
Dependent Variable: Diameter Diameter
The REG Procedure
Model: MODEL2
Dependent Variable: Diameter Diameter
Number of Observations Read | 3843 |
---|---|
Number of Observations Used | 3843 |
Analysis of Variance | |||||
---|---|---|---|---|---|
Source | DF |
Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 3 | 98321 | 32774 | 1789.31 | <.0001 |
Error | 3839 | 70316 | 18.31631 | ||
Corrected Total | 3842 | 168637 |
Root MSE | 4.27976 | R-Square | 0.5830 |
---|---|---|---|
Dependent Mean | 3.40791 | Adj R-Sq | 0.5827 |
Coeff Var | 125.58308 |
Parameter Estimates | ||||||
---|---|---|---|---|---|---|
Variable | Label | DF |
Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | Intercept | 1 | 1.93041 | 0.07463 | 25.87 | <.0001 |
Depth | Depth | 1 | 19.88768 | 0.66893 | 29.73 | <.0001 |
Depth2 | Depth Squared | 1 | 2.79146 | 0.44676 | 6.25 | <.0001 |
Layers | Layers | 1 | -3.28885 | 0.25092 | -13.11 | <.0001 |
The REG Procedure
Model: MODEL2
Dependent Variable: Diameter Diameter
The REG Procedure
Model: MODEL2
Partial Regression Residual Plot
1 /* Using SAS Educational Virtual Machine running locally */
2 /* For CSV Files uploaded from MacOS */
3 FILENAME CSV "/folders/myfolders/marscrater_pds.csv"
4 TERMSTR = CRLF;
5
6 PROC IMPORT DATAFILE = CSV
7 OUT = WORK
8 DBMS = CSV
9 REPLACE;
10 RUN;
11
12 /* Unassign the file reference. */
13 FILENAME CSV;
14
15 /* Select a sample of 10% of the population */
16 PROC SURVEYSELECT DATA = WORK
17 OUT = EXTRACT
18 METHOD = SRS
19 SAMPRATE = 0.01
20 SEED = 196587;
21 ID DIAM_CIRCLE_IMAGE
22 DEPTH_RIMFLOOR_TOPOG
23 NUMBER_LAYERS;
24 RUN;
25
26 DATA SAMPLE;
27 SET EXTRACT;
28 /* remove outliers */
29 WHERE DIAM_CIRCLE_IMAGE < 200;
30 /* No need to Centre the Explanatory variables as they
31 Already include a meaningful Zero */
32
33 Diameter = DIAM_CIRCLE_IMAGE;
34 Depth = DEPTH_RIMFLOOR_TOPOG;
35 Layers = NUMBER_LAYERS;
36
37 /* Create squared versions to check for quadratic regression */
38 Depth2 = Depth * Depth;
39 Layers2 = Layers * Layers;
40
41 LABEL Diameter = "Diameter" /* Response - Quantitative */
42 Depth = "Depth" /* Explanatory - Quantitative */
43 Layers = "Layers" /* Explanatory - Quantitative */
44 Depth2 = "Depth Squared"
45 Layers2 = "Layers Squared";
46 RUN;
47
48 PROC SUMMARY DATA = SAMPLE PRINT ;
49 VAR Diameter
50 Depth
51 Depth2
52 Layers
53 Layers2;
54 TITLE "Mars' Craters - Summary - Sample Data";
55 RUN;
56
57 PROC SGPLOT DATA = SAMPLE ;
58 REG X = Depth
59 Y = Diameter
60 / FILLEDOUTLINEDMARKERS
61 MARKERATTRS = (COLOR = GREY SIZE = 5 SYMBOL = CIRCLEFILLED)
62 MARKERFILLATTRS = (COLOR = GREY)
63 MARKEROUTLINEATTRS = (COLOR = GREY THICKNESS = 1)
64 LINEATTRS = (COLOR = RED THICKNESS = 2) DEGREE = 1 CLM;
65 REG X = Depth
66 Y = Diameter
67 / LINEATTRS = (COLOR = GREEN THICKNESS = 2) DEGREE = 2 CLM;
68 TITLE "Mars' Craters - Diameter by Depth - Sample Data";
69 YAXIS LABEL = "Diameter";
70 XAXIS LABEL = "Depth";
71 RUN;
72
73 PROC GLM
74 DATA = SAMPLE
75 PLOTS = DIAGNOSTICS;
76 MODEL Diameter = Depth / CLPARM;
77 OUTPUT OUT = RESULTS RESIDUAL = Residual STUDENT = Student;
78 TITLE "Mars' Craters - GLM: Diameter by Depth - Sample Data";
79 RUN;
80
81 PROC GLM
82 DATA = SAMPLE
83 PLOTS = NONE;
84 MODEL Diameter = Layers / CLPARM;
85 TITLE "Mars' Craters - GLM: Diameter by Layers - Sample Data";
86 RUN;
87
88 PROC GLM
89 DATA = SAMPLE
90 PLOTS = NONE;
91 MODEL Diameter = Depth Layers / CLPARM;
92 TITLE "Mars' Craters - GLM: Diameter by Depth and Layers - Sample Data";
93 RUN;
94
95 PROC GLM
96 DATA = SAMPLE
97 PLOTS = NONE;
98 MODEL Diameter = Depth Depth2 / CLPARM;
99 TITLE "Mars' Craters - GLM: Diameter by Depth and Depth Squared - Sample Data";
100 RUN;
101
102 PROC REG
103 DATA = SAMPLE
104 PLOTS = PARTIAL;
105 MODEL Diameter = Depth;
106 MODEL Diameter = Depth Depth2 Layers / PARTIAL;
107 TITLE "Mars' Craters - Regression - Sample Data";
108 RUN;
109
110 TITLE ;