Sample Selection Method
Selection Method | Simple Random Sampling |
---|
Specialisation | Data Analysis and Interpretation |
Course | Regression Modeling in Practice |
Education Institution | Wesleyan University |
Publisher | Coursera |
Assignment | Test a Logistic Regression Model |
Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allow scientists to put together a big jigsaw puzzle.
An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is known and the curiosity, or even need, to find out much more.
When testing for the association between Hemisphere with Diameter and Depth using Logistic Regression the p-values are above the alpha value of 0.05, which indicates that there is no strong association between the variables when classified as binary variables.
With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir was released on early Oct/15), the data set chosen for this assignment is the Mars Craters.
The Mars Craters Study, presents a global database that includes 378,540 Mars craters, with diameter of 1 km or larger, that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).
The data set was made available by Wesleyan University/Coursera as part of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.
The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.
As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.
DIAM_CIRCLE_IMAGE
by its median, with value of 0 for diameters less than 1.53 and value of 1 otherwise.A sample of about one percent of the original data set, with 3,844 out of 378,540 cases, will be used for the regression.
Although the point estimate for the Odds Ratio is 1.027, with a confidence interval of (0.902, 1.169) the p-value of 0.6969 is above the alpha value of 0.05.
As per the logistic regression model, with binary categorical variables, there is not much association between Hemisphere and Diameter.
Although the point estimate for the Odds Ratio of diameter is 0.983, with a confidence interval of (0.847, 1.140) the p-value of 0.8205 is above the alpha value of 0.05. The same for the point estimate for the Odds Ratio of depth is 1.118, with a confidence interval of (0.927, 1.347) the p-value of 0.2426 is above the alpha value of 0.05.
As per the logistic regression model, with binary categorical variables, there is not much association between Hemisphere by Diameter and Depth.
The charts on both analysis show a ROC curve that is almost on top of the 50% threshold with an Area Under the Curve of 0.5033 for Diameter and 0.5097 for Diameter and Depth.
The SURVEYSELECT Procedure
Selection Method | Simple Random Sampling |
---|
Input Data Set | WORK |
---|---|
Random Number Seed | 196587 |
Sampling Rate | 0.01 |
Sample Size | 3844 |
Selection Probability | 0.010001 |
Sampling Weight | 99.98517 |
Output Data Set | EXTRACT |
The SUMMARY Procedure
Variable | Minimum | Mean | Median | Maximum |
---|---|---|---|---|
DIAM_CIRCLE_IMAGE
DEPTH_RIMFLOOR_TOPOG
|
1.0000000
-0.0300000
|
3.4845239
0.0777940
|
1.5300000
0
|
297.9200000
2.4100000
|
The SUMMARY Procedure
North Hemisphere | N Obs |
---|---|
0 | 2343 |
1 | 1501 |
The SUMMARY Procedure
Big Diameter | N Obs |
---|---|
0 | 1913 |
1 | 1931 |
The SUMMARY Procedure
Big Depth | N Obs |
---|---|
0 | 3074 |
1 | 770 |
The LOGISTIC Procedure
Model Information | ||
---|---|---|
Data Set | WORK.SAMPLE | |
Response Variable | North_Hemisphere | North Hemisphere |
Number of Response Levels | 2 | |
Model | binary logit | |
Optimization Technique | Fisher's scoring |
Number of Observations Read | 3844 |
---|---|
Number of Observations Used | 3844 |
Response Profile | ||
---|---|---|
Ordered Value |
North_Hemisphere |
Total Frequency |
1 | 0 | 2343 |
2 | 1 | 1501 |
Probability modeled is North_Hemisphere=0.
Model Convergence Status |
---|
Convergence criterion (GCONV=1E-8) satisfied. |
Model Fit Statistics | ||
---|---|---|
Criterion | Intercept Only | Intercept and Covariates |
AIC | 5144.978 | 5146.820 |
SC | 5151.232 | 5159.328 |
-2 Log L | 5142.978 | 5142.820 |
Testing Global Null Hypothesis: BETA=0 | |||
---|---|---|---|
Test | Chi-Square | DF | Pr > ChiSq |
Likelihood Ratio | 0.1582 | 1 | 0.6909 |
Score | 0.1582 | 1 | 0.6909 |
Wald | 0.1581 | 1 | 0.6909 |
Analysis of Maximum Likelihood Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate |
Standard Error |
Wald Chi-Square |
Pr > ChiSq |
Intercept | 1 | 0.4321 | 0.0468 | 85.2492 | <.0001 |
Big_Diameter | 1 | 0.0263 | 0.0661 | 0.1581 | 0.6909 |
Odds Ratio Estimates | |||
---|---|---|---|
Effect | Point Estimate |
95% Wald Confidence Limits |
|
Big_Diameter | 1.027 | 0.902 | 1.169 |
Association of Predicted Probabilities and Observed Responses | |||
---|---|---|---|
Percent Concordant | 25.3 | Somers' D | 0.007 |
Percent Discordant | 24.7 | Gamma | 0.013 |
Percent Tied | 50.0 | Tau-a | 0.003 |
Pairs | 3516843 | c | 0.503 |
The LOGISTIC Procedure
Model Information | ||
---|---|---|
Data Set | WORK.SAMPLE | |
Response Variable | North_Hemisphere | North Hemisphere |
Number of Response Levels | 2 | |
Model | binary logit | |
Optimization Technique | Fisher's scoring |
Number of Observations Read | 3844 |
---|---|
Number of Observations Used | 3844 |
Response Profile | ||
---|---|---|
Ordered Value |
North_Hemisphere |
Total Frequency |
1 | 0 | 2343 |
2 | 1 | 1501 |
Probability modeled is North_Hemisphere=0.
Model Convergence Status |
---|
Convergence criterion (GCONV=1E-8) satisfied. |
Model Fit Statistics | ||
---|---|---|
Criterion | Intercept Only | Intercept and Covariates |
AIC | 5144.978 | 5147.451 |
SC | 5151.232 | 5166.214 |
-2 Log L | 5142.978 | 5141.451 |
Testing Global Null Hypothesis: BETA=0 | |||
---|---|---|---|
Test | Chi-Square | DF | Pr > ChiSq |
Likelihood Ratio | 1.5265 | 2 | 0.4662 |
Score | 1.5199 | 2 | 0.4677 |
Wald | 1.5192 | 2 | 0.4679 |
Analysis of Maximum Likelihood Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate |
Standard Error |
Wald Chi-Square |
Pr > ChiSq |
Intercept | 1 | 0.4318 | 0.0468 | 85.1401 | <.0001 |
Big_Diameter | 1 | -0.0172 | 0.0757 | 0.0515 | 0.8205 |
Big_Depth | 1 | 0.1114 | 0.0953 | 1.3652 | 0.2426 |
Odds Ratio Estimates | |||
---|---|---|---|
Effect | Point Estimate |
95% Wald Confidence Limits |
|
Big_Diameter | 0.983 | 0.847 | 1.140 |
Big_Depth | 1.118 | 0.927 | 1.347 |
Association of Predicted Probabilities and Observed Responses | |||
---|---|---|---|
Percent Concordant | 32.0 | Somers' D | 0.019 |
Percent Discordant | 30.1 | Gamma | 0.031 |
Percent Tied | 37.8 | Tau-a | 0.009 |
Pairs | 3516843 | c | 0.510 |
1 /* Using SAS Educational Virtual Machine running locally */
2 /* For CSV Files uploaded from MacOS */
3 FILENAME CSV "/folders/myfolders/marscrater_pds.csv"
4 TERMSTR = CRLF;
5
6 PROC IMPORT DATAFILE = CSV
7 OUT = WORK
8 DBMS = CSV
9 REPLACE;
10 RUN;
11
12 /* Unassign the file reference. */
13 FILENAME CSV;
14
15 /* Select a sample of 10% of the population */
16 PROC SURVEYSELECT DATA = WORK
17 OUT = EXTRACT
18 METHOD = SRS
19 SAMPRATE = 0.01
20 SEED = 196587;
21 ID LATITUDE_CIRCLE_IMAGE
22 DIAM_CIRCLE_IMAGE
23 DEPTH_RIMFLOOR_TOPOG;
24 RUN;
25
26 PROC SUMMARY DATA = EXTRACT MIN MEAN MEDIAN MAX PRINT ;
27 VAR DIAM_CIRCLE_IMAGE
28 DEPTH_RIMFLOOR_TOPOG;
29 TITLE "Mars' Craters - Summary - Sample Data";
30 RUN;
31
32 DATA SAMPLE;
33 SET EXTRACT;
34
35 IF LATITUDE_CIRCLE_IMAGE < 0
36 THEN North_Hemisphere = 0;
37 ELSE North_Hemisphere = 1;
38
39 IF DIAM_CIRCLE_IMAGE < 1.53 /* Median */
40 THEN Big_Diameter = 0;
41 ELSE Big_Diameter = 1;
42
43 IF DEPTH_RIMFLOOR_TOPOG <= 0 /* Median */
44 THEN Big_Depth = 0;
45 ELSE Big_Depth = 1;
46
47 LABEL North_Hemisphere = "North Hemisphere" /* Response - Categorical */
48 Big_Diameter = "Big Diameter" /* Explanatory - Bin into Categorical */
49 Big_Depth = "Big Depth"; /* Explanatory - Bin into Categorical */
50 RUN;
51
52 PROC SUMMARY DATA = SAMPLE PRINT ;
53 CLASS North_Hemisphere;
54 TITLE "Mars' Craters - Summary: North Hemisphere - Sample Data";
55 RUN;
56
57 PROC SUMMARY DATA = SAMPLE PRINT ;
58 CLASS Big_Diameter;
59 TITLE "Mars' Craters - Summary: Big Diameter - Sample Data";
60 RUN;
61
62 PROC SUMMARY DATA = SAMPLE PRINT ;
63 CLASS Big_Depth;
64 TITLE "Mars' Craters - Summary: Big Depth - Sample Data";
65 RUN;
66
67 PROC LOGISTIC
68 DATA = SAMPLE
69 PLOTS = ROC;
70 MODEL North_Hemisphere = Big_Diameter;
71 TITLE "Mars' Craters - Logistic - Hemisphere by Diameter - Sample Data";
72 RUN;
73
74 PROC LOGISTIC
75 DATA = SAMPLE
76 PLOTS = ROC;
77 MODEL North_Hemisphere = Big_Diameter Big_Depth;
78 TITLE "Mars' Craters - Logistic - Hemisphere by Diameter and Depth - Sample Data";
79 RUN;
80
81 TITLE ;