Summary

Specialisation Data Analysis and Interpretation
Course Regression Modeling in Practice
Education Institution Wesleyan University
Publisher Coursera
Assignment Test a Logistic Regression Model

Introduction

Besides the historical fascination on Mars due to its proximity and some similarities with Earth, the collection of facts allow scientists to put together a big jigsaw puzzle.

An announcement made by NASA just recently about evidence of flowing liquid water on the surface of Mars just adds to all that is known and the curiosity, or even need, to find out much more.

Summary

When testing for the association between Hemisphere with Diameter and Depth using Logistic Regression the p-values are above the alpha value of 0.05, which indicates that there is no strong association between the variables when classified as binary variables.

The Data Set

With all the talks about Mars, including both Science and Fiction (a Ridley Scott’s movie called The Martian, based on the book with the same name, by Andy Weir was released on early Oct/15), the data set chosen for this assignment is the Mars Craters.

The Mars Craters Study, presents a global database that includes 378,540 Mars craters, with diameter of 1 km or larger, that were created between 4.2 and 3.8 billion years ago during a period of heavy bombardment (i.e. impacts of asteroids, proto-planets, and comets).

The data set was made available by Wesleyan University/Coursera as part of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.

Topic of Interest

The data set provides a catalogue of craters on Mars. The initial thoughts are about checking for patterns that could identify specific major events that might have happened and that would have significant impact on Mars’ geology, climate and life as a planetary body.

Codebook

As the initial data set has only nine variables, they could all the relevant to formulate hypothesis and help in leading to a conclusion, so all the variables will be kept for this assignment.

Variables

  • CRATER_ID: crater ID for internal sue, based upon the region of the planet (\({1 \over 16}\)), the “pass” under which the crate was identified, ad the order in which it was identified
  • LATITUDE_CIRCLE_IMAGE: latitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees North)
  • LONGITUDE_CIRCLE_IMAGE: longitude from the derived centre of a non-linear least-squares circle fit to the vertices selected to manually identify the crater rim (units are decimal degrees East)
  • DIAM_CIRCLE_IMAGE: diameter from a non-linear least squares circle fit to the vertices selected to manually identify the crater rim (units are km)
  • DEPTH_RIMFLOOR_TOPOG: average elevation of each of the manually determined N points along (or inside) the crater rim (units are km)
    • Depth Rim: Points are selected as relative topographic highs under the assumption they are the least eroded so most original points along the rim
    • Depth Floor: Points were chosen as the lowest elevation that did not include visible embedded craters
  • MORPHOLOGY_EJECTA_1: ejecta morphology classified.
    • If there are multiple values, separated by a “/”, then the order is the inner-most ejecta through the outer-most, or the top-most through the bottom-most
  • MORPHOLOGY_EJECTA_2: the morphology of the layer(s) itself/themselves. This classification system is unique to this work.
  • MORPHOLOGY_EJECTA_3: overall texture and/or shape of some of the layer(s)/ejecta that are generally unique and deserve separate morphological classification.
  • NUMBER_LAYERS: the maximum number of cohesive layers in any azimuthal direction that could be reliably identified

Test a Logistic Regression Model

Variables for the analysis

  • North_Hemisphere is a categorical variable that has value 0 for the South hemisphere and value 1 for the North hemisphere. This is the binary categorical response variable.
  • Big_Diameter is a categorical variable that splits DIAM_CIRCLE_IMAGE by its median, with value of 0 for diameters less than 1.53 and value of 1 otherwise.
  • Big_Depth is a categorical variable with value of 0 for depth less than or equal to zero and value of 1 otherwise.

Sample

A sample of about one percent of the original data set, with 3,844 out of 378,540 cases, will be used for the regression.

Regression

Diameter

Although the point estimate for the Odds Ratio is 1.027, with a confidence interval of (0.902, 1.169) the p-value of 0.6969 is above the alpha value of 0.05.

As per the logistic regression model, with binary categorical variables, there is not much association between Hemisphere and Diameter.

Diameter and Depth

Although the point estimate for the Odds Ratio of diameter is 0.983, with a confidence interval of (0.847, 1.140) the p-value of 0.8205 is above the alpha value of 0.05. The same for the point estimate for the Odds Ratio of depth is 1.118, with a confidence interval of (0.927, 1.347) the p-value of 0.2426 is above the alpha value of 0.05.

As per the logistic regression model, with binary categorical variables, there is not much association between Hemisphere by Diameter and Depth.

Charts

The charts on both analysis show a ROC curve that is almost on top of the 50% threshold with an Area Under the Curve of 0.5033 for Diameter and 0.5097 for Diameter and Depth.

Using SAS

SAS Output

Results: W04-Test a Logistic Regression Model-Local.sas

Results: W04-Test a Logistic Regression Model-Local.sas

The SURVEYSELECT Procedure

The SURVEYSELECT Procedure

Sample Selection Method

Selection Method Simple Random Sampling

Sample Selection Summary

Input Data Set WORK
Random Number Seed 196587
Sampling Rate 0.01
Sample Size 3844
Selection Probability 0.010001
Sampling Weight 99.98517
Output Data Set EXTRACT

Mars' Craters - Summary - Sample Data

The SUMMARY Procedure

The SUMMARY Procedure

Summary statistics

Variable Minimum Mean Median Maximum
DIAM_CIRCLE_IMAGE
DEPTH_RIMFLOOR_TOPOG
1.0000000
-0.0300000
3.4845239
0.0777940
1.5300000
0
297.9200000
2.4100000

Mars' Craters - Summary: North Hemisphere - Sample Data

The SUMMARY Procedure

The SUMMARY Procedure

Summary statistics

North Hemisphere N Obs
0 2343
1 1501

Mars' Craters - Summary: Big Diameter - Sample Data

The SUMMARY Procedure

The SUMMARY Procedure

Summary statistics

Big Diameter N Obs
0 1913
1 1931

Mars' Craters - Summary: Big Depth - Sample Data

The SUMMARY Procedure

The SUMMARY Procedure

Summary statistics

Big Depth N Obs
0 3074
1 770

Mars' Craters - Logistic - Hemisphere by Diameter - Sample Data

The LOGISTIC Procedure

The LOGISTIC Procedure

Model Information

Model Information
Data Set WORK.SAMPLE  
Response Variable North_Hemisphere North Hemisphere
Number of Response Levels 2  
Model binary logit  
Optimization Technique Fisher's scoring  

Observations Summary

Number of Observations Read 3844
Number of Observations Used 3844

Response Profile

Response Profile
Ordered
Value
North_Hemisphere Total
Frequency
1 0 2343
2 1 1501

Probability modeled is North_Hemisphere=0.

Convergence Status

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Fit Statistics

Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 5144.978 5146.820
SC 5151.232 5159.328
-2 Log L 5142.978 5142.820

Global Tests

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 0.1582 1 0.6909
Score 0.1582 1 0.6909
Wald 0.1581 1 0.6909

Parameter Estimates

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 0.4321 0.0468 85.2492 <.0001
Big_Diameter 1 0.0263 0.0661 0.1581 0.6909

Odds Ratios

Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
Big_Diameter 1.027 0.902 1.169

Association Statistics

Association of Predicted Probabilities and Observed Responses
Percent Concordant 25.3 Somers' D 0.007
Percent Discordant 24.7 Gamma 0.013
Percent Tied 50.0 Tau-a 0.003
Pairs 3516843 c 0.503

ROC Curve

ROC Curve for Model

Mars' Craters - Logistic - Hemisphere by Diameter and Depth - Sample Data

The LOGISTIC Procedure

The LOGISTIC Procedure

Model Information

Model Information
Data Set WORK.SAMPLE  
Response Variable North_Hemisphere North Hemisphere
Number of Response Levels 2  
Model binary logit  
Optimization Technique Fisher's scoring  

Observations Summary

Number of Observations Read 3844
Number of Observations Used 3844

Response Profile

Response Profile
Ordered
Value
North_Hemisphere Total
Frequency
1 0 2343
2 1 1501

Probability modeled is North_Hemisphere=0.

Convergence Status

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Fit Statistics

Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 5144.978 5147.451
SC 5151.232 5166.214
-2 Log L 5142.978 5141.451

Global Tests

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 1.5265 2 0.4662
Score 1.5199 2 0.4677
Wald 1.5192 2 0.4679

Parameter Estimates

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 0.4318 0.0468 85.1401 <.0001
Big_Diameter 1 -0.0172 0.0757 0.0515 0.8205
Big_Depth 1 0.1114 0.0953 1.3652 0.2426

Odds Ratios

Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
Big_Diameter 0.983 0.847 1.140
Big_Depth 1.118 0.927 1.347

Association Statistics

Association of Predicted Probabilities and Observed Responses
Percent Concordant 32.0 Somers' D 0.019
Percent Discordant 30.1 Gamma 0.031
Percent Tied 37.8 Tau-a 0.009
Pairs 3516843 c 0.510

ROC Curve

ROC Curve for Model

SAS Code

 1   /* Using SAS Educational Virtual Machine running locally */
 2 /* For CSV Files uploaded from MacOS */
 3 FILENAME CSV     "/folders/myfolders/marscrater_pds.csv"
 4          TERMSTR = CRLF;
 5
 6 PROC IMPORT DATAFILE = CSV
 7             OUT      = WORK
 8             DBMS     = CSV
 9             REPLACE;
10 RUN;
11
12 /* Unassign the file reference. */
13 FILENAME CSV;
14
15 /* Select a sample of 10% of the population */
16 PROC SURVEYSELECT DATA     = WORK
17                   OUT      = EXTRACT
18                   METHOD   = SRS
19                   SAMPRATE = 0.01
20                   SEED     = 196587;
21   ID LATITUDE_CIRCLE_IMAGE
22      DIAM_CIRCLE_IMAGE
23      DEPTH_RIMFLOOR_TOPOG;
24 RUN;
25
26 PROC SUMMARY DATA = EXTRACT MIN MEAN MEDIAN MAX PRINT ;
27   VAR   DIAM_CIRCLE_IMAGE
28         DEPTH_RIMFLOOR_TOPOG;
29   TITLE "Mars' Craters - Summary - Sample Data";
30 RUN;
31
32 DATA SAMPLE;
33   SET EXTRACT;
34
35   IF LATITUDE_CIRCLE_IMAGE < 0
36     THEN North_Hemisphere = 0;
37     ELSE North_Hemisphere = 1;
38
39   IF DIAM_CIRCLE_IMAGE < 1.53 /* Median */
40     THEN Big_Diameter = 0;
41     ELSE Big_Diameter = 1;
42
43   IF DEPTH_RIMFLOOR_TOPOG <= 0 /* Median */
44     THEN Big_Depth = 0;
45     ELSE Big_Depth = 1;
46
47   LABEL North_Hemisphere = "North Hemisphere" /* Response    - Categorical */
48         Big_Diameter     = "Big Diameter"     /* Explanatory - Bin into Categorical */
49         Big_Depth        = "Big Depth";       /* Explanatory - Bin into Categorical */
50 RUN;
51
52 PROC SUMMARY DATA = SAMPLE PRINT ;
53   CLASS North_Hemisphere;
54   TITLE "Mars' Craters - Summary: North Hemisphere - Sample Data";
55 RUN;
56
57 PROC SUMMARY DATA = SAMPLE PRINT ;
58   CLASS Big_Diameter;
59   TITLE "Mars' Craters - Summary: Big Diameter - Sample Data";
60 RUN;
61
62 PROC SUMMARY DATA = SAMPLE PRINT ;
63   CLASS Big_Depth;
64   TITLE "Mars' Craters - Summary: Big Depth - Sample Data";
65 RUN;
66
67 PROC LOGISTIC
68   DATA  = SAMPLE
69   PLOTS = ROC;
70   MODEL North_Hemisphere = Big_Diameter;
71   TITLE "Mars' Craters - Logistic - Hemisphere by Diameter - Sample Data";
72 RUN;
73
74 PROC LOGISTIC
75   DATA  = SAMPLE
76   PLOTS = ROC;
77   MODEL North_Hemisphere = Big_Diameter Big_Depth;
78   TITLE "Mars' Craters - Logistic - Hemisphere by Diameter and Depth - Sample Data";
79 RUN;
80
81 TITLE ;