DATA622 - Assignment 4: Predicting Exoplanet Habitability With Machine Learning

Author

Anthony Josue Roman

Introduction

This project involved using machine learning to find habitable exoplanets, which is an academic interest of mine. Modern observatories including TESS and JWST have discovered thousands of exoplanets; however, due to the extremely limited availability of time on telescopes, a small number of exoplanets will receive follow-up observations in detail. The objective of this study is to determine how best to allocate the limited available observational resources to the most promising planets.

From a Data Science perspective, the question of how to prioritise planets for follow-up observations can be addressed by producing a classification model that determines the likelihood of a planet being in a simple habitability classification based upon their planetary and stellar characteristics. The model I produced uses two methods of supervised learning, namely Logistic Regression, which is an easy-to-interpret baseline model from traditional machine learning, and Gradient Boosting, which is a more sophisticated ensemble method that captures nonlinear relationships between features, and thereby generally produces improved predictive accuracy compared to standard Logistic Regression models.

Data and Setup

The data source for this research is the Confirmed Planets table within the NASA Exoplanet Archive. You can obtain this table from this website and save it on your local drive as confirmed_planets.csv. Each entry in the table corresponds to a single confirmed planet; with that entry comes a number of parameters related to the associated planetary characteristics as well as those of the star that the planet orbits. Since many of the parameters are numeric and a target label can be established based on physical parameters, this would be an example of supervised learning.

I will also include the code for how I downloaded the table as a csv file below:

import pandas as pd

url = "https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+*+from+pscomppars&format=csv"
df = pd.read_csv(url)
df.to_csv("confirmed_planets.csv", index=False)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay,
    roc_auc_score,
    RocCurveDisplay
)

plt.rcParams["figure.figsize"] = (8, 5)

csv_path = "confirmed_planets.csv"

df = pd.read_csv(csv_path, comment="#")
df.shape, df.columns.tolist()[:15]

((6053, 683),
 ['objectid',
  'pl_name',
  'pl_letter',
  'hostid',
  'hostname',
  'hd_name',
  'hip_name',
  'tic_id',
  'disc_pubdate',
  'disc_year',
  'disc_method',
  'discoverymethod',
  'disc_locale',
  'disc_facility',
  'disc_instrument'])

For my analysis, I have chosen a subset of the numeric variables that are indicative of planetary habitability. These include: the radius of a planet in relation to Earth, the mass of a planet in relation to the mass of Earth, the distance from the star to the planet, the amount of light received by the planet, the temperature, and various stellar parameters.

cols_needed = [
    "pl_name",
    "pl_rade",
    "pl_bmasse",
    "pl_orbsmax",
    "pl_insol",
    "pl_eqt",
    "st_teff",
    "st_mass",
    "st_rad",
    "st_metfe"
]

available_cols = [c for c in cols_needed if c in df.columns]
df = df[available_cols].copy()
df.head()

	pl_name	pl_rade	pl_bmasse	pl_orbsmax	pl_insol	pl_eqt	st_teff	st_mass	st_rad
0	Kepler-1167 b	1.710000	3.570	0.01750	1039.697	1419.0	4971.0	0.790	0.750
1	Kepler-1740 b	3.323214	11.000	0.07790	127.950	858.0	5705.0	0.943	0.905
2	Kepler-1581 b	0.800000	0.437	0.06865	470.862	1108.0	6022.0	1.120	1.230
3	Kepler-644 b	3.150000	10.100	0.04641	2381.770	1655.0	6747.0	1.490	1.810
4	Kepler-1752 b	4.540605	18.700	0.26980	7.290	419.0	5446.0	0.824	0.821

Label Definition and Data Cleaning

The target variable I’ve generated comes from a very simplistic method of defining habitability using accepted standards found in the literature. A planet that meets the criteria for having a similar sufficient mass with respect to Earth and receiving an optimal amount of stellar energy will be classified as habitable. All other planets that do not meet these criteria will be classified as non-habitable. Thus the classification relies upon a very simplistic representation of a binary variable that determines how good of a candidate a planet may be for future observational study.

required_for_label = [c for c in ["pl_rade", "pl_insol"] if c in df.columns]
df = df.dropna(subset=required_for_label)

radius_min = 0.5
radius_max = 1.8
insol_min = 0.35
insol_max = 1.7

df["habitable"] = (
    (df["pl_rade"].between(radius_min, radius_max)) &
    (df["pl_insol"].between(insol_min, insol_max))
).astype(int)

df["habitable"].value_counts()

habitable
0    4192
1      38
Name: count, dtype: int64

Using the above simplistic definition, a planet is deemed to be a “habitable” candidate if its radius is between 0.5 and 1.8 times the Earth’s radius and its insolation falls within the range from 0.35 to 1.7 times that of Earth. The ranges referenced here correspond to those of terrestrial planets located within the conservative habitability zones of stars. Consequently, there is an incredibly uneven distribution amongst classes, which follows the pattern of the majority of currently identified extrasolar planets being either significantly larger than Earth or receive high levels of irradiation.

Exploratory Data Analysis

Determining the structure and biases of any dataset is essential prior to developing a model of those data. The first aspect I reviewed was the planet radii distribution.

df["pl_rade"].hist(bins=30)
plt.xlabel("Planet radius (Earth radii)")
plt.ylabel("Count")
plt.title("Distribution of Planet Radii")
plt.tight_layout()
plt.show()

The insolation flux distribution is highly skewed due to the large number of stars with high levels of flux, particularly those identified as hot Jupiters and ultra-short period planets that orbit close to their host stars. A very small percentage of planets in the dataset receive Earth-like levels of insolation flux.

The equilibrium temperature distribution follows a similar pattern to that of insolation flux.

if "pl_eqt" in df.columns:
    df["pl_eqt"].hist(bins=30)
    plt.xlabel("Equilibrium Temperature (K)")
    plt.ylabel("Count")
    plt.title("Distribution of Equilibrium Temperature")
    plt.tight_layout()
    plt.show()

The majority of known planets have equilibrium temperatures that range between approximately 500K and 1500K; both are significantly hotter than the temperature of Earth. Again, only a very small percentage of the planets in the current sample exist in a potentially habitable thermal regime.

The last step I took was to examine the relationship between insolation and stellar effective temperature, and to then compare that to the habitability label of the points on the plot.

if "st_teff" in df.columns:
    scatter = plt.scatter(
        df["st_teff"],
        df["pl_insol"],
        c=df["habitable"],
        cmap="coolwarm",
        alpha=0.7
    )
    plt.colorbar(scatter, label="Habitable (1) vs Non-habitable (0)")
    plt.xlabel("Stellar effective temperature (K)")
    plt.ylabel("Insolation (Earth = 1)")
    plt.title("Stellar Temperature vs Insolation")
    plt.tight_layout()
    plt.show()

From that scatter plot, I observed that the candidate habitable planets lie within a narrow band of moderate insolation values. This observation supports the definitions of the conservative habitability zones established for the purpose of creating the habitability labels and reinforces the idea that candidate habitable planets are very rare in the overall sample analyzed.

Model Preparation

To create the model, I will take the numerical feature columns and drop any other rows that have missing values in these numerical feature columns or in the habitability label column. After being cleaned of any rows with missing data for either of these criteria, I will then create my training and test sets by conducting a stratified sampling of my cleaned data to preserve any existing class imbalance in my data.

feature_cols = [c for c in [
    "pl_rade",
    "pl_bmasse",
    "pl_orbsmax",
    "pl_insol",
    "pl_eqt",
    "st_teff",
    "st_mass",
    "st_rad",
    "st_metfe"
] if c in df.columns]

df_model = df.dropna(subset=feature_cols + ["habitable"]).copy()

X = df_model[feature_cols]
y = df_model["habitable"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)

X_train.shape, X_test.shape

((2836, 8), (1216, 8))

Baseline Model: Logistic Regression

Logistic regression is a baseline model that is relatively easy to interpret and understand, as discussed in the first section of this course. The model assumes that there is a linear relationship between the independent variables (or features) and the dependent variable (or target). This makes it a good first choice for determining whether there is already an existing classification structure in the data.

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_reg = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    solver="lbfgs"
)
log_reg.fit(X_train_scaled, y_train)

y_pred_lr = log_reg.predict(X_test_scaled)
y_proba_lr = log_reg.predict_proba(X_test_scaled)[:, 1]

print("Logistic Regression classification report:")
print(classification_report(y_test, y_pred_lr, digits=3))

Logistic Regression classification report:
              precision    recall  f1-score   support

           0      1.000     0.956     0.978      1206
           1      0.159     1.000     0.274        10

    accuracy                          0.956      1216
   macro avg      0.579     0.978     0.626      1216
weighted avg      0.993     0.956     0.972      1216

cm_lr = confusion_matrix(y_test, y_pred_lr)
disp_lr = ConfusionMatrixDisplay(
    confusion_matrix=cm_lr,
    display_labels=["Non-habitable", "Habitable"]
)
disp_lr.plot(cmap="Blues")
plt.title("Logistic Regression - Confusion Matrix")
plt.tight_layout()
plt.show()

auc_lr = roc_auc_score(y_test, y_proba_lr)
print("Logistic Regression ROC AUC:", round(auc_lr, 3))
RocCurveDisplay.from_predictions(y_test, y_proba_lr)
plt.title("Logistic Regression - ROC Curve")
plt.tight_layout()
plt.show()

Logistic Regression ROC AUC: 0.994

Almost all of the results obtained from using logistic regression performed quite well; it produced a very high ROC AUC for the test dataset, indicating the ability to discriminate between examples of habitable and non-habitable planets. All of the planets classified as habitable were correctly identified with no false negatives. This is particularly significant because, scientifically, the cost of failing to identify a habitable planet is higher than that of incorrectly classifying a non-habitable planet as habitable. While logistic regression did misclassify some planets that were not habitable as habitable, this is acceptable due to the ability to follow up with future observations that will clarify these cases.

Advanced Model: Gradient Boosting

Gradient Boosting is a tree-based method for combining multiple weak learners, which is a topic that we discussed in the latter part of this course. This approach allows us to sequentially create and combine weak learners to form an ensemble of models that can identify the non-linear relationships between different features.

gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train, y_train)

y_pred_gb = gb.predict(X_test)
y_proba_gb = gb.predict_proba(X_test)[:, 1]

print("Gradient Boosting classification report:")
print(classification_report(y_test, y_pred_gb, digits=3))

Gradient Boosting classification report:
              precision    recall  f1-score   support

           0      1.000     1.000     1.000      1206
           1      1.000     1.000     1.000        10

    accuracy                          1.000      1216
   macro avg      1.000     1.000     1.000      1216
weighted avg      1.000     1.000     1.000      1216

cm_gb = confusion_matrix(y_test, y_pred_gb)
disp_gb = ConfusionMatrixDisplay(
    confusion_matrix=cm_gb,
    display_labels=["Non-habitable", "Habitable"]
)
disp_gb.plot(cmap="Greens")
plt.title("Gradient Boosting - Confusion Matrix")
plt.tight_layout()
plt.show()

auc_gb = roc_auc_score(y_test, y_proba_gb)
print("Gradient Boosting ROC AUC:", round(auc_gb, 3))
RocCurveDisplay.from_predictions(y_test, y_proba_gb)
plt.title("Gradient Boosting - ROC Curve")
plt.tight_layout()
plt.show()

Gradient Boosting ROC AUC: 1.0

Using Gradient Boosting on this dataset yielded a perfect classification of the test dataset with an ROC AUC of 1.00. This indicates that our label for habitability is very distinct from the other labels based on the feature set being utilized and that Linear Ensemble Learning is leveraging the ability of the different features to capture the non-linear interaction between radius, insolation, and temperature.

Feature Importance

I analyze the feature importances of the Gradient Boosting model to better understand the training of the model.

importances = gb.feature_importances_
fi = pd.Series(importances, index=feature_cols).sort_values(ascending=False)
fi

pl_rade       0.561615
pl_insol      0.258389
pl_eqt        0.179996
pl_bmasse     0.000000
pl_orbsmax    0.000000
st_teff       0.000000
st_mass       0.000000
st_rad        0.000000
dtype: float64

fi.plot(kind="bar")
plt.ylabel("Importance")
plt.title("Feature Importances - Gradient Boosting")
plt.tight_layout()
plt.show()

The features that are most important in terms of determining the importance ranking within the model are planet radius, insolation flux and equilibrium temperature. This makes sense, as these three parameters provide the degree to which a planet is potentially habitable. However, stellar parameters such as mass, radius and metallicity are not significant contributors to the overall Importance ranking at this time, though they may be in a future definition of “habitability,” where stellar properties are explicitly incorporated into the definition.

Conclusions and Business Impact

This project demonstrates that simple and complex methods of supervised learning can be used to rank exoplanet targets to prioritize them for further observational study. The initial baseline model provided by logistic regression, which is a basic linear structure, and the fact that many candidate habitable planets can be identified from within the general population using simple techniques suggest a very strong initial impression of the ability of gradient boosting models to effectively use nonlinear interactions between parameters of planet size, insolation, and temperature to achieve perfect results on the test set. Gradient Boosting also achieves much higher performance than Logistic Regression because it can capture these nonlinearities in the Data.

From the standpoint of business or mission planning the use of these models can be valuable for making informed decisions regarding how to allocate scarce telescope time to specific targets. By producing a ranked list of those planets expected to have the highest habitability score, observatories can direct their spectroscopic and atmospheric measurements at those planets most likely to be Earth-like candidates. By so doing, they will both increase their chances of finding such candidates, as well as decrease their chances of overlooking strong candidate candidates. Even with the simplified habitat definition utilized in this project, there are many more facets to this workflow that can naturally be extended to accommodate more complex criterion based assessments, as well as continue to allow the use of machine learning as a tool to make advances in the search for life beyond the solar system.