Summary

Specialisation Data Analysis and Interpretation
Course Data Management and Visualisation
Education Institution Wesleyan University
Publisher Coursera
Assignment Running Your First Program

Exploratory Data Analysis

Source

The Mars Craters data set was made available by Wesleyan University/Coursera as part of the Data Management and Visualisation course, of the Data Analysis and Interpretation Specialisation, from the Ph.D. Thesis Planetary Surface Properties, Cratering Physics, and the Volcanic History of Mars from a New Global Martian Crater Database (2011) by Robbins, S.J., University of Colorado at Boulder.

Size

The data set has a total of 384343 observations and 10 variables.

The variables are: CRATER_ID, CRATER_NAME, LATITUDE_CIRCLE_IMAGE, LONGITUDE_CIRCLE_IMAGE, DIAM_CIRCLE_IMAGE, DEPTH_RIMFLOOR_TOPOG, MORPHOLOGY_EJECTA_1, MORPHOLOGY_EJECTA_2, MORPHOLOGY_EJECTA_3 and NUMBER_LAYERS.

Univariate Analysis

Hemisphere

Hemisphere is a variable derived from the LATITUDE_CIRCLE_IMAGE variable to transform the continuous coordinates into categories, for the sake of brevity.

Hemisphere shows seven occurrences in the Equator, same as Latitude equals to zero. Just above 60% of the observations are located in the South Hemisphere. Also, all the observations have values.

Ejecta Morphology 1 (Group by Main Feature)

The variable MORPHOLOGY_EJECTA_1 has 339718 out of 384343 values missing, or 88.3%. The recording with existing content are divided in a large number of categories if considered the full morphology qualification. If taken into account just the first classification, the number of categories is reduced to 31.

From the recorded data, considering just the first classification, shows that 27068, or 60.6%, are of the Rd category. The only two other categories that have more than 10% are SLERS (11.45% = 5110) and SLEPS (11.20% = 4998).

Maximum Number of Cohesive Layers

The NUMBER_LAYERS variable has six categories (0, 1, 2, 3, 4 and 5) and none of its observations are missing. The vast majority of craters are identified as having “0” layers, counting 364612, or 94.87% of the records.

Using SAS

Code

/* Use Course's Library */
LIBNAME mydata "/courses/d1406ae5ba27fe300" ACCESS = readonly;

/* Configure the Data */
DATA NEW;
  /* Data set */
  SET   mydata.marscrater_pds;
  LABEL Hemisphere    = "Hemisphere"
        MorphoE1      = "Ejecta Morphology 1 (Grouped by Main Feature)"
        NUMBER_LAYERS = "Maximum Number of Cohesive Layers";

  /* Categorise the Latitude in Hemispheres */
  IF (LATITUDE_CIRCLE_IMAGE > 0)
    THEN Hemisphere = "North";
    ELSE Hemisphere = "South";
  IF (LATITUDE_CIRCLE_IMAGE = 0)
    THEN Hemisphere = "Equator";

  /* Collapse the Morphology of Eject 1 to its Main Feature, to reduce the output */
  IF (INDEX(MORPHOLOGY_EJECTA_1, "/") = 0)
    THEN MorphoE1 = MORPHOLOGY_EJECTA_1;
    ELSE MorphoE1 = SUBSTR(MORPHOLOGY_EJECTA_1, 1, INDEX(MORPHOLOGY_EJECTA_1, "/") - 1);

PROC SORT;
  BY CRATER_ID;
/* Calculate Frequencies and Proportions */
PROC FREQ;
  TABLE Hemisphere MorphoE1 NUMBER_LAYERS;
RUN;

Output

Hemisphere

Hemisphere Frequency Percent Cumulative Frequency Cumulative Percent
Equator 7 0.00 7 0.00
North 150887 39.26 150894 39.26
South 233449 60.74 384343 100.00

Ejecta Morphology 1 (Group by Main Feature)

MorphoE1 Frequency Percent Cumulative Frequency Cumulative Percent
DLEPC 495 1.11 495 1.11
DLEPCPd 10 0.02 505 1.13
DLEPS 631 1.41 1136 2.55
DLEPSPd 2 0.00 1138 2.55
DLEPd 1 0.00 1139 2.55
DLERC 386 0.86 1525 3.42
DLERCPd 7 0.02 1532 3.43
DLERS 1242 2.78 2774 6.22
DLERSRd 2 0.00 2776 6.22
DLSPC 1 0.00 2777 6.22
MLEPC 22 0.05 2799 6.27
MLEPS 43 0.10 2842 6.37
MLERC 24 0.05 2866 6.42
MLERS 491 1.10 3357 7.52
MLERSRd 1 0.00 3358 7.52
Pd 2 0.00 3360 7.53
RD 1 0.00 3361 7.53
Rd 27068 60.66 30429 68.19
SLEPC 2601 5.83 33030 74.02
SLEPCPd 75 0.17 33105 74.18
SLEPCRd 2 0.00 33107 74.19
SLEPS 4998 11.20 38105 85.39
SLEPSPd 52 0.12 38157 85.51
SLEPSRd 3 0.01 38160 85.51
SLEPd 44 0.10 38204 85.61
SLERC 1280 2.87 39484 88.48
SLERCPd 10 0.02 39494 88.50
SLERS 5110 11.45 44604 99.95
SLERSPd 16 0.04 44620 99.99
SLERSRd 4 0.01 44624 100.00
SLErS 1 0.00 44625 100.00

Frequency Missing = 339718

Maximum Number of Cohesive Layers

NUMBER_LAYERS Frequency Percent Cumulative Frequency Cumulative Percent
0 364612 94.87 364612 94.87
1 15467 4.02 380079 98.89
2 3435 0.89 383514 99.78
3 739 0.19 384253 99.98
4 85 0.02 384338 100.00
5 5 0.00 384343 100.00

Using Python

Code

"""
Created on Tue Sep 29 18:12:40 2015

@author: angeloklin
"""
# Import libraries
import pandas as pd
import numpy  as np

# load data
data = pd.read_csv("marscrater_pds.csv", na_values = [" "], low_memory = False)

# function to return hemisphere
def Hemisphere(Latitude):
  if Latitude > 0:
    return "North"
  elif Latitude < 0:
    return "South"
  else:
    return "Equator"

# function to get the morphology's main feature
def MainMorpho(Morpho):
  if pd.isnull(Morpho):
    return Morpho
  foundAt = Morpho.find("/")
  if foundAt >= 0:
    return Morpho[0:foundAt]
  else:
    return Morpho

print("Mars Craters' data set summary:")
print("- Number of observations(rows): ", len(data))
print("- Number of variables(columns): ", len(data.columns))
print("")

print("Hemispheres:")
Hemispheres = data["LATITUDE_CIRCLE_IMAGE"].map(lambda lat: Hemisphere(lat))
freq = Hemispheres.value_counts(sort = True)
prop = Hemispheres.value_counts(sort = True, normalize = True)
print("- Missing Values: ", Hemispheres.isnull().sum())
print("- Frequency Table: ")
print("|            | Frequency | Proportion |")
for i in range(len(freq)):
  print("|", format(freq.index[i], "<10s"), "|   ", format(freq[i], ">6d"), "|    ", format(prop[i], ">.4f"), "|")
print("")

print("Ejecta Morphology 1 (Group by Main Feature):")
MorphoE1 = data["MORPHOLOGY_EJECTA_1"].map(lambda morpho: MainMorpho(morpho))
MorphoE1a = MorphoE1[MorphoE1.notnull()]
freq = MorphoE1a.value_counts(sort = True)
prop = MorphoE1a.value_counts(sort = True, normalize = True, dropna = False)
print("- Missing Values: ", MorphoE1.isnull().sum())
print("- Frequency Table: ")
print("|            | Frequency | Proportion |")
for i in range(len(freq)):
  print("|", format(freq.index[i], "<10s"), "|   ", format(freq[i], ">6d"), "|    ", format(prop[i], ">.4f"), "|")
print("")

print("Maximum Number of Cohesive Layers:")
freq = data["NUMBER_LAYERS"].value_counts(sort = False)
prop = data["NUMBER_LAYERS"].value_counts(sort = False, normalize = True)
print("- Missing Values: ", data["NUMBER_LAYERS"].isnull().sum())
print("- Frequency Table: ")
print("|        | Frequency | Proportion |")
for i in range(len(freq)):
  print("|", format(freq.index[i], "<6d"), "|   ", format(freq[i], ">6d"), "|    ", format(prop[i], ">.4f"), "|")
print("")

Output

Mars Craters’ data set summary:

  • Number of observations(rows): 384343
  • Number of variables(columns): 10

Hemisphere

Missing Values: 0

Frequency Table:

Frequency Proportion
South 233449 0.6074
North 150887 0.3926
Equator 7 0.0000

Ejecta Morphology 1 (Group by Main Feature)

Missing Values: 339718

Frequency Table:

Frequency Proportion
Rd 27068 0.6066
SLERS 5110 0.1145
SLEPS 4998 0.1120
SLEPC 2601 0.0583
SLERC 1280 0.0287
DLERS 1242 0.0278
DLEPS 631 0.0141
DLEPC 495 0.0111
MLERS 491 0.0110
DLERC 386 0.0086
SLEPCPd 75 0.0017
SLEPSPd 52 0.0012
SLEPd 44 0.0010
MLEPS 43 0.0010
MLERC 24 0.0005
MLEPC 22 0.0005
SLERSPd 16 0.0004
SLERCPd 10 0.0002
DLEPCPd 10 0.0002
DLERCPd 7 0.0002
SLERSRd 4 0.0001
SLEPSRd 3 0.0001
SLEPCRd 2 0.0000
DLEPSPd 2 0.0000
DLERSRd 2 0.0000
Pd 2 0.0000
RD 1 0.0000
MLERSRd 1 0.0000
DLEPd 1 0.0000
SLErS 1 0.0000
DLSPC 1 0.0000

Maximum Number of Cohesive Layers

Missing Values: 0

Frequency Table:

Frequency Proportion
0 364612 0.9487
1 15467 0.0402
2 3435 0.0089
3 739 0.0019
4 85 0.0002
5 5 0.0000