alt text here

The goal of this project is to build a classification model that predicts whether a celestial object detected by the Kepler Space Telescope is a confirmed exoplanet or not. We use features from the Kepler Objects of Interest (KOI) dataset to identify patterns in confirmed vs. false-positive planetary candidates. This work supports the growing field of astrostatistics and aims to assist in automatically identifying habitable exoplanets.

Introduction

Predicting Habitable Exoplanets

Using NASA Kepler Mission Data and Machine Learning to Classify Exoplanet Candidates
Sudarshan Raj
UCSB Summer 2025

Introduction

The goal of this project is to build a machine learning model that classifies whether a celestial object observed by NASA’s Kepler Space Telescope is a confirmed exoplanet or not. I use a public dataset from NASA’s Exoplanet Archive that contains more than 9,500 Kepler Objects of Interest (KOIs) labeled as either confirmed planets, false positives, or unconfirmed candidates. The features available include orbital characteristics of the candidate planets (such as orbital period and radius), as well as properties of the host stars (such as effective temperature, surface gravity, and radius). My classification task simplifies these into a binary outcome—planet or not—using these astrophysical predictors.

To tackle this problem, I will apply a range of machine learning techniques, including logistic regression, decision trees, random forests, and boosted trees. I’ll evaluate their performance through ROC AUC and other classification metrics. The final goal is to identify which model most accurately predicts whether a KOI is a confirmed exoplanet. In doing so, this project explores not only a predictive modeling task, but also a real-world scientific question: how can we use data science to assist in the discovery of other worlds?

Inspiration and Motive

When I first learned about exoplanets, I was just a freshman in in high school lecture, I was captivated by the idea that distant planets orbiting stars light-years away might harbor the right conditions for life. A life very similar to ours. ALthough I was in no science class at the time, rather in Spanish 1A (for the second time), listening to Niel Degrasse Tyson’s countless podcast episodes on Spotify with one hidden airpod in my ear. Although years later, with my mind maturing and somewhat more focused on my studies, I began to understand the sheer complexity and scale of this field: thousands of objects are detected, but only a fraction are ultimately confirmed as planets. The rest are false positives or ambiguous signals. The process of confirming planets is intensive, often requiring multiple observations, statistical modeling, and human interpretation. With so much data, and so much room for human error or bottlenecks, I began to wonder: How could machine learning help us prioritize and validate these detections more efficiently?

Listening to Neil deGrasse Tyson taught me how science gives us the tools to not only understand the cosmos but to feel connected to it. If I can use machine learning to help even slightly in the search for habitable planets, then I’m part of something meaningful.

This idea turned into something much larger than a project. I started looking into the Kepler mission and its data release, reading about how scientists used algorithms and light curve analysis to identify signals from distant stars. I imagined what it would be like to contribute to that process—not as an astrophysicist, but as a data scientist. I realized that this dataset contained all the elements of a great machine learning problem: labeled outcomes, meaningful numerical predictors, scientific stakes, and a high degree of public interest.

At its heart, this project is driven by curiosity and the hope of contributing to something bigger. Even if this model never gets deployed in practice, I believe that every exploration into how data can enhance discovery is a step forward, big or small. Perhaps someday, these models could atually help prioritize the most promising candidates for telescope time. Maybe they could help reduce false alarms or help schoolchildren see new planets discovered by code written in R. Either way, I’m excited to explore the intersection of astronomy and artificial intelligence.

Data Description

The dataset used in this project was obtained from NASA’s Exoplanet Archive and specifically includes the Kepler Objects of Interest (KOI) table, which is a curated summary of potential planets discovered by the Kepler Space Telescope. The table includes both confirmed exoplanets and false positives, along with detailed measurements of each candidate’s orbit, host star, and detection confidence. There are 49 variables and over 9,500 observations in the full dataset. Each observation corresponds to a single candidate object, and the koi_disposition field is the target variable, indicating whether the object is CONFIRMED, FALSE POSITIVE, or CANDIDATE.

For the purpose of this project, I simplify the target into a binary classification: “Yes” for CONFIRMED planets, and “No” for all others. The features I use include: koi_score (a confidence score), koi_period (orbital period in days), koi_prad (planet radius in Earth radii), koi_steff (stellar effective temperature), koi_slogg (stellar surface gravity), and koi_srad (stellar radius in solar units). These were selected based on domain knowledge and exploratory data analysis.

The data has been cleaned to remove NA values and filtered to include only rows with non-missing values for these predictors. This ensures that our modeling is conducted on a reliable and consistent subset of the data. The original data file is available publicly at the NASA Exoplanet Archive and includes extensive metadata and documentation. The dataset is ideal for binary classification tasks and has been widely used in both academic and exploratory machine learning research.

Project Outline

The modeling workflow begins by reading in and cleaning the Kepler KOI dataset. I remove columns that are irrelevant for prediction or have excessive missing data, and transform the target variable into a binary factor. From there, I conduct exploratory data analysis (EDA), generating summary statistics and visualizations to better understand the structure of the data and relationships among variables.

After EDA, I split the data into training and testing sets using stratified sampling to ensure a balanced class distribution. I then define a preprocessing pipeline using recipes from the tidymodels ecosystem, normalizing the predictors to account for differing units and scales. For resampling and tuning, I implement 5-fold cross-validation to evaluate model stability and performance on the training set.

Four different models are trained and tuned: logistic regression, decision tree, random forest, and gradient boosted trees (XGBoost). Each model is evaluated using ROC AUC, accuracy, and confusion matrices. Hyperparameters are tuned using grid search where appropriate. After selecting the best-performing model, I apply it to the testing set to assess its generalization performance. Finally, I discuss the model’s results and suggest possible extensions, such as incorporating more features, testing neural networks, or applying this model to TESS or JWST datasets in future research.

Let’s begin our journey into the stars—through data!

“The universe is under no obligation to make sense to you.”

Neil deGrasse Tyson

Key vocabulary: - Exoplanet: A planet located outside of our solar system. - Confirmed: Verified as an actual planet. - False Positive: A signal interpreted as a planet but later invalidated.

Data Source

The data is sourced from the NASA Exoplanet Archive’s Kepler dataset. It contains hundreds of KOIs and includes both confirmed and rejected planet candidates.

Citation: NASA Exoplanet Archive, Caltech/IPAC, https://exoplanetarchive.ipac.caltech.edu/

Step 2: Exploratory Data Analysis

library(tidyverse)     # Includes readr (read_csv), ggplot2, dplyr, etc.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.8     ✔ rsample      1.3.0
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.8     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.3.1     ✔ yardstick    1.3.2
## ✔ recipes      1.3.0     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggcorrplot)
exo_raw <- read_csv("exoplanets.csv") %>% clean_names()

## Rows: 9564 Columns: 49
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): kepoi_name, kepler_name, koi_disposition, koi_pdisposition, koi_tc...
## dbl (42): kepid, koi_score, koi_fpflag_nt, koi_fpflag_ss, koi_fpflag_co, koi...
## lgl  (2): koi_teq_err1, koi_teq_err2
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

exo_raw %>% glimpse()

## Rows: 9,564
## Columns: 49
## $ kepid             <dbl> 10797460, 10797460, 10811496, 10848459, 10854555, 10…
## $ kepoi_name        <chr> "K00752.01", "K00752.02", "K00753.01", "K00754.01", …
## $ kepler_name       <chr> "Kepler-227 b", "Kepler-227 c", NA, NA, "Kepler-664 …
## $ koi_disposition   <chr> "CONFIRMED", "CONFIRMED", "CANDIDATE", "FALSE POSITI…
## $ koi_pdisposition  <chr> "CANDIDATE", "CANDIDATE", "CANDIDATE", "FALSE POSITI…
## $ koi_score         <dbl> 1.000, 0.969, 0.000, 0.000, 1.000, 1.000, 1.000, 0.9…
## $ koi_fpflag_nt     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ koi_fpflag_ss     <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1…
## $ koi_fpflag_co     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ koi_fpflag_ec     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ koi_period        <dbl> 9.488036, 54.418383, 19.899140, 1.736952, 2.525592, …
## $ koi_period_err1   <dbl> 2.78e-05, 2.48e-04, 1.49e-05, 2.63e-07, 3.76e-06, 2.…
## $ koi_period_err2   <dbl> -2.78e-05, -2.48e-04, -1.49e-05, -2.63e-07, -3.76e-0…
## $ koi_time0bk       <dbl> 170.5387, 162.5138, 175.8503, 170.3076, 171.5956, 17…
## $ koi_time0bk_err1  <dbl> 2.16e-03, 3.52e-03, 5.81e-04, 1.15e-04, 1.13e-03, 1.…
## $ koi_time0bk_err2  <dbl> -2.16e-03, -3.52e-03, -5.81e-04, -1.15e-04, -1.13e-0…
## $ koi_impact        <dbl> 0.1460, 0.5860, 0.9690, 1.2760, 0.7010, 0.5380, 0.76…
## $ koi_impact_err1   <dbl> 0.3180, 0.0590, 5.1260, 0.1150, 0.2350, 0.0300, 0.13…
## $ koi_impact_err2   <dbl> -0.1460, -0.4430, -0.0770, -0.0920, -0.4780, -0.4280…
## $ koi_duration      <dbl> 2.95750, 4.50700, 1.78220, 2.40641, 1.65450, 4.59450…
## $ koi_duration_err1 <dbl> 0.08190, 0.11600, 0.03410, 0.00537, 0.04200, 0.06100…
## $ koi_duration_err2 <dbl> -0.08190, -0.11600, -0.03410, -0.00537, -0.04200, -0…
## $ koi_depth         <dbl> 616, 875, 10800, 8080, 603, 1520, 686, 227, 234, 491…
## $ koi_depth_err1    <dbl> 19.5, 35.5, 171.0, 12.8, 16.9, 24.2, 18.7, 16.8, 5.8…
## $ koi_depth_err2    <dbl> -19.5, -35.5, -171.0, -12.8, -16.9, -24.2, -18.7, -1…
## $ koi_prad          <dbl> 2.26, 2.83, 14.60, 33.46, 2.75, 3.90, 2.77, 1.59, 39…
## $ koi_prad_err1     <dbl> 0.26, 0.32, 3.92, 8.50, 0.88, 1.27, 0.90, 0.52, 6.45…
## $ koi_prad_err2     <dbl> -0.15, -0.19, -1.31, -2.83, -0.35, -0.42, -0.30, -0.…
## $ koi_teq           <dbl> 793, 443, 638, 1395, 1406, 835, 1160, 1360, 1342, 60…
## $ koi_teq_err1      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ koi_teq_err2      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ koi_insol         <dbl> 93.59, 9.11, 39.30, 891.96, 926.16, 114.81, 427.65, …
## $ koi_insol_err1    <dbl> 29.45, 2.87, 31.04, 668.95, 874.33, 112.85, 420.33, …
## $ koi_insol_err2    <dbl> -16.65, -1.62, -10.49, -230.35, -314.24, -36.70, -13…
## $ koi_model_snr     <dbl> 35.8, 25.8, 76.3, 505.6, 40.9, 66.5, 40.2, 15.0, 47.…
## $ koi_tce_plnt_num  <dbl> 1, 2, 1, 1, 1, 1, 2, 3, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1…
## $ koi_tce_delivname <chr> "q1_q17_dr25_tce", "q1_q17_dr25_tce", "q1_q17_dr25_t…
## $ koi_steff         <dbl> 5455, 5455, 5853, 5805, 6031, 6046, 6046, 6046, 6227…
## $ koi_steff_err1    <dbl> 81, 81, 158, 157, 169, 189, 189, 189, 111, 75, 78, 7…
## $ koi_steff_err2    <dbl> -81, -81, -176, -174, -211, -232, -232, -232, -124, …
## $ koi_slogg         <dbl> 4.467, 4.467, 4.544, 4.564, 4.438, 4.486, 4.486, 4.4…
## $ koi_slogg_err1    <dbl> 0.064, 0.064, 0.044, 0.053, 0.070, 0.054, 0.054, 0.0…
## $ koi_slogg_err2    <dbl> -0.096, -0.096, -0.176, -0.168, -0.210, -0.229, -0.2…
## $ koi_srad          <dbl> 0.927, 0.927, 0.868, 0.791, 1.046, 0.972, 0.972, 0.9…
## $ koi_srad_err1     <dbl> 0.105, 0.105, 0.233, 0.201, 0.334, 0.315, 0.315, 0.3…
## $ koi_srad_err2     <dbl> -0.061, -0.061, -0.078, -0.067, -0.133, -0.105, -0.1…
## $ ra                <dbl> 291.9342, 291.9342, 297.0048, 285.5346, 288.7549, 29…
## $ dec               <dbl> 48.14165, 48.14165, 48.13413, 48.28521, 48.22620, 48…
## $ koi_kepmag        <dbl> 15.347, 15.347, 15.436, 15.597, 15.509, 15.714, 15.7…

Exploratory Data Analysis (EDA)

Before building any models, we first explore the dataset to understand the distribution and relationships between key variables. We are particularly interested in how planet characteristics (such as radius and orbital period) and stellar properties (such as temperature and radius) differ between confirmed exoplanets and false positives.

We begin by examining the structure of the dataset and plotting key variables to identify patterns, possible outliers, and the overall balance of the classification labels. This step is crucial to understand whether preprocessing or feature engineering is necessary before modeling.

The following analysis includes: - Summary statistics and structure of the cleaned dataset - Class balance of the outcome variable (planet) - Visualizations of distributions and relationships between variables

Data Cleaning

Before performing exploratory data analysis, we clean the dataset to ensure all relevant columns are available and that our target variable is clearly defined.

We are interested in classifying whether a Kepler Object of Interest (KOI) is a confirmed planet (CONFIRMED) or not. Therefore, we define a binary target variable planet with two levels: "Yes" for confirmed planets, and "No" for all other cases.

We also remove observations with missing values for selected numerical predictors.

# Clean and prepare dataset
exo <- exo_raw %>%
  filter(!is.na(koi_disposition)) %>%
  mutate(
    planet = factor(if_else(koi_disposition == "CONFIRMED", "Yes", "No")),
    across(where(is.character), as.factor)
  ) %>%
  select(planet, koi_score, koi_period, koi_prad, koi_steff, koi_slogg, koi_srad) %>%
  drop_na()

# Preview cleaned data
exo %>% glimpse()

## Rows: 7,995
## Columns: 7
## $ planet     <fct> Yes, Yes, No, No, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Ye…
## $ koi_score  <dbl> 1.000, 0.969, 0.000, 0.000, 1.000, 1.000, 1.000, 0.992, 0.0…
## $ koi_period <dbl> 9.488036, 54.418383, 19.899140, 1.736952, 2.525592, 11.0943…
## $ koi_prad   <dbl> 2.26, 2.83, 14.60, 33.46, 2.75, 3.90, 2.77, 1.59, 39.21, 5.…
## $ koi_steff  <dbl> 5455, 5455, 5853, 5805, 6031, 6046, 6046, 6046, 6227, 5031,…
## $ koi_slogg  <dbl> 4.467, 4.467, 4.544, 4.564, 4.438, 4.486, 4.486, 4.486, 3.9…
## $ koi_srad   <dbl> 0.927, 0.927, 0.868, 0.791, 1.046, 0.972, 0.972, 0.972, 1.9…

exo %>% glimpse()

# Class balance
exo %>% count(planet) %>%
  mutate(pct = n / sum(n) * 100)

Pairwise Relationships

We use a pair plot to explore pairwise relationships between numeric features. Confirmed planets (Yes) appear to have different distributions in radius and orbital period compared to non-confirmed (No) candidates

exo %>%
  ggpairs(columns = 2:7, aes(color = planet, alpha = 0.4))

To address and analyze how the predictors relate to one another — and how those relationships vary by planet status — we use a pairwise plot from the GGally package. This matrix of scatterplots, density plots, and correlation statistics provides a comprehensive overview of the dataset’s structure. Each cell in the matrix compares two predictors across the two classes of the target variable (planet), with additional color coding and correlation stats.

Immediately, we notice that confirmed planets (shown in a distinct color, such as blue) tend to cluster differently than non-confirmed ones. For example, in the plots involving koi_prad (planet radius), confirmed exoplanets show tighter grouping and less variance, while non-confirmed candidates are more widely scattered, often with extreme values. The same is true for the relationships involving koi_score, a confidence metric assigned to each candidate. In fact, koi_score shows stronger negative correlation with koi_steff and koi_period in the confirmed class than in the non-confirmed class, suggesting different structural relationships across categories.

This plot also highlights distributional skewness in several variables, especially koi_period and koi_prad, reinforcing the need for scaling or transformation during preprocessing. More subtly, the density plots on the diagonal indicate class imbalance and varying spread, which models will need to accommodate. Overall, this pair plot acts as a visual summary of the modeling challenges ahead — nonlinear relationships, outliers, and varying scales — and also reveals encouraging patterns that support our project goal.

Boxplot

Boxplots reveal the spread and outliers of planet radius (koi_prad) across planet types. Confirmed exoplanets tend to have smaller radii than false positives or non-confirmed KOIs. Before building models, it’s important to visually examine how key predictors behave across the two outcome classes — confirmed and non-confirmed planet candidates. The variable koi_prad, which measures the planet radius in Earth radii, serves as one of the most intuitive features to analyze. The boxplot below illustrates the distribution of planet radius values across the binary response variable planet. This plot helps us identify how compact or spread the values are for each class, as well as detect any extreme outliers.

ggplot(exo, aes(x = planet, y = koi_prad, fill = planet)) +
  geom_boxplot() +
  labs(title = "Planet Radius vs. Confirmation Status", y = "Planet Radius (Earth radii)", x = "Confirmed?")

From the visualization, we immediately notice that confirmed exoplanets tend to cluster around a smaller range of radius values, generally under 10 Earth radii, while the “No” group — which includes false positives and ambiguous signals — has a much wider spread and several extreme outliers, some reaching sizes far exceeding realistic planetary thresholds. These outliers may result from noise, instrument error, or data entry anomalies. The compactness of the “Yes” group suggests that confirmed planets often fall within a physically plausible and statistically narrow size range. This pattern aligns with our physical understanding of planet formation and gives early indication that koi_prad could be a strong predictor in our modeling phase.

Histogram

Another critical feature in exoplanet classification is koi_period, which measures how long a planet takes to orbit its star, in days. This orbital period can provide valuable insights into whether a detected object behaves like a planet or not. However, the raw distribution of this variable is highly skewed, as many planets have short orbital periods, with fewer objects orbiting over months or years. To better visualize the distribution, we apply a log scale to the x-axis, helping to compress long tails and illuminate the density of shorter-period objects.

ggplot(exo, aes(x = koi_period)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "white") +
  scale_x_log10() +
  labs(title = "Histogram of Orbital Period (log scale)", x = "Orbital Period (days)", y = "Count")

The distribution of koi_period (orbital period) is right-skewed, with many planets having short orbits. Most of the candidate objects, including many confirmed planets, orbit their host stars in fewer than 100 days — a byproduct of both observational bias (short-period planets are easier to detect with the transit method) and astrophysical reality. Interestingly, there appears to be a dense cluster of objects with orbits under 10 days, which likely includes many hot Jupiters and compact planetary systems. On the far right, we observe a tapering tail of longer-period objects, though they are relatively rare in the dataset. This distribution gives us a clearer understanding of the observational landscape and reinforces the need for transformation or normalization in modeling workflows.

Correlation Heatmap

Before modeling, it’s important to understand the relationships between our predictor variables. If certain features are highly correlated, they may introduce multicollinearity in some models or affect variable importance rankings. To visualize these relationships, we generated a correlation heatmap using the ggcorrplot package, which displays Pearson correlation coefficients between the numeric variables in our dataset.

library(ggcorrplot) 

exo %>%
  select(where(is.numeric)) %>%
  cor() %>%
  round(2) %>%
  as.data.frame() %>%
  ggcorrplot::ggcorrplot(lab = TRUE, type = "lower", tl.cex = 10)

The first striking observation is the strong negative correlation (-0.64) between koi_slogg (stellar surface gravity) and koi_srad (stellar radius). This makes physical sense — as stellar radius increases, surface gravity tends to decrease, all else being equal. Interestingly, most of the other features show only weak correlations with each other, which suggests that we have a diverse set of predictors that capture different aspects of planetary systems. This diversity is beneficial for model stability. That said, minor correlations such as the weak negative relationship between koi_score and koi_steff may still play a role in feature interactions, so it will be important to monitor them during model training.

The matrix also affirms that many variables are fairly independent, which supports their inclusion in our initial models. Still, the lower-right triangle allows us to verify that there are no extremely collinear pairs (e.g., > 0.8 or < -0.8), reducing concern for redundancy in tree-based methods or L1-regularized models. In sum, this visualization gives us confidence that the predictors offer unique, complementary information about our celestial candidates.

Appendix: Codebook

Variable	Type	Description
`planet`	Factor	Target variable; `"Yes"` if the planet is considered Earth-like, `"No"` otherwise.
`koi_score`	Numeric	Disposition score from the Kepler Object of Interest (KOI); ranges from 0 to 1 indicating confidence level in the planet candidate.
`koi_period`	Numeric	Orbital period of the planet in Earth days. Indicates how long the planet takes to complete one orbit around its star.
`koi_prad`	Numeric	Planetary radius in Earth radii. Gives a sense of how large the planet is compared to Earth.
`koi_steff`	Numeric	Effective temperature of the host star in Kelvin. Related to the star’s heat output.
`koi_slogg`	Numeric	Logarithm of the star’s surface gravity, measured in cgs (cm/s²). Inversely related to the star’s size.
`koi_srad`	Numeric	Stellar radius in solar radii. Indicates the size of the star relative to the Sun.

Step 3: Data Preprocessing for Modeling

Model Set-Up

Level: Before we can fit and evaluate machine learning models to predict the likelihood that an exoplanet is Earth-like, we must first prepare the dataset to support robust model training and validation. This section outlines three critical steps: data splitting, recipe creation, and the implementation of k-fold cross-validation. These processes ensure that all models are trained consistently and evaluated under comparable conditions, minimizing bias and variance in our results.

Length: We begin by performing a stratified split of the exoplanet dataset into training and testing subsets. Using a 70/30 split stratified on the binary target variable (is_earthlike), we ensure that both sets maintain similar proportions of Earth-like and non-Earth-like planets. This prevents imbalanced class distributions from happening.

Next, we construct a tidymodels recipe to standardize the preprocessing steps across all models. Key features such as orbital period, stellar temperature, planet radius, and surface flux are retained, while categorical variables like discovery method or detection facility are converted into dummy variables. Missing values in continuous predictors are imputed using linear models based on other available features. All numerical predictors are centered and scaled to ensure compatibility with distance-based algorithms and to facilitate regularization in penalized models.

Finally, we establish 10-fold cross-validation on the training set, stratified on is_earthlike. This approach divides the training data into 10 subsets (folds), iteratively training on 9 and validating on the remaining one. K-fold cross-validation provides a more reliable estimate of model performance across different partitions of the data and avoids the pitfalls of relying on a single training or testing split. It also ensures each model’s hyperparameters are selected based on generalizability.

# Load libraries (if not already done)
library(tidyverse)
library(tidymodels)

# Set seed for reproducibility
set.seed(131)

# Split data into training and testing sets
split <- initial_split(exo, prop = 0.8, strata = planet)
train <- training(split)
test <- testing(split)

# Create a recipe (you may modify variables here as needed)
exo_recipe <- recipe(planet ~ ., data = train) %>%
  step_zv(all_predictors()) %>%            # Remove zero variance predictors
  step_normalize(all_numeric_predictors()) # Normalize numeric predictors

# 10-fold cross-validation setup
folds <- vfold_cv(train, v = 10, strata = planet)

# Preview recipe
exo_recipe

##

## ── Recipe ──────────────────────────────────────────────────────────────────────

##

## ── Inputs

## Number of variables by role

## outcome:   1
## predictor: 6

##

## ── Operations

## • Zero variance filter on: all_predictors()

## • Centering and scaling for: all_numeric_predictors()

Logistic Regression (Baseline Model)

Logistic regression serves as a strong and interpretable baseline model in binary classification problems such as ours, where the goal is to determine whether a planet hass Earth-like features. It models the probability of the response variable (planet) being “Yes” using a logistic function applied to a linear combination of the predictors. In this context, it assumes a linear relationship between the log-odds of a planet being Earth-like and its physical and orbital characteristics. ] By comparing its performance to more flexible models, we can understand whether added complexity yields significant predictive improvements.

# Logistic Regression model specification
log_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

# Workflow
log_wf <- workflow() %>%
  add_model(log_spec) %>%
  add_recipe(exo_recipe)

# Fit with cross-validation
log_results <- fit_resamples(
  log_wf,
  resamples = folds,
  metrics = metric_set(accuracy, roc_auc)
)

## → A | warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## There were issues with some computations   A: x1There were issues with some computations   A: x2There were issues with some computations   A: x5There were issues with some computations   A: x8There were issues with some computations   A: x10

collect_metrics(log_results)

Decision Tree

In the case of predicting Earth-like planets, decision trees can uncover simple, interpretable rules — for example, planets with radii between certain values or orbiting stars within a particular temperature range. We tuned hyperparameters like cost_complexity and tree_depth using cross-validation to avoid overfitting while allowing enough flexibility to discover meaningful structure. Although single decision trees can suffer from high variance, they often highlight key thresholds and relationships in the data that may go unnoticed in more opaque models.

# Decision Tree model with tuning
tree_spec <- decision_tree(
  cost_complexity = tune(),
  tree_depth = tune()
) %>%
  set_engine("rpart") %>%
  set_mode("classification")

# Workflow
tree_wf <- workflow() %>%
  add_model(tree_spec) %>%
  add_recipe(exo_recipe)

# Grid
tree_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  levels = 4
)

# Tune with CV
tree_results <- tune_grid(
  tree_wf,
  resamples = folds,
  grid = tree_grid,
  metrics = metric_set(accuracy, roc_auc)
)

collect_metrics(tree_results)

Random Forest

Random forest improves upon individual decision trees by building an ensemble of trees using random subsets of predictors and observations. In our context, each tree in the forest captures a slightly different view of the astronomical feature space, and the final classification is determined by aggregating these diverse perspectives. This robustness helps reduce variance and improve predictive accuracy. We tuned the number of predictors sampled at each split (mtry) and the minimum number of data points per leaf node (min_n) to optimize performance. Random forests are especially well-suited to this problem as they handle nonlinear interactions and do not require extensive preprocessing. For this project they also provide variable importance metrics, allowing us to identify which stellar and planetary features most predictive of Earth-likeness.

# Random Forest with tuning
rf_spec <- rand_forest(
  mtry = tune(),
  min_n = tune(),
  trees = 500
) %>%
  set_engine("ranger") %>%
  set_mode("classification")

# Workflow
rf_wf <- workflow() %>%
  add_model(rf_spec) %>%
  add_recipe(exo_recipe)

# Grid
rf_grid <- grid_regular(
  mtry(range = c(1, 5)),
  min_n(range = c(2, 10)),
  levels = 4
)

# Tune with CV
rf_results <- tune_grid(
  rf_wf,
  resamples = folds,
  grid = rf_grid,
  metrics = metric_set(accuracy, roc_auc)
)

collect_metrics(rf_results)

Boosted Tree

High performance is frequently achieved by this iterative refinement, particularly when the conclusion is driven by subtle feature interactions, as is probably the case with exoplanet classification. Using the potent xgboost engine, we put this model into practice and adjusted important hyperparameters as learning rate, tree depth, and loss reduction. On structured tabular data like ours, boosted trees frequently perform better than conventional models, although requiring more computing power. They are especially good at addressing any unequal class distributions because of their capacity to concentrate on observations that have been incorrectly classified. Therefore, in the hunt for Earth-like worlds, enhanced trees are among our best candidates for final model selection and deployment.

# Boosted Tree model
boost_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(),
  learn_rate = tune(),
  loss_reduction = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

# Workflow
boost_wf <- workflow() %>%
  add_model(boost_spec) %>%
  add_recipe(exo_recipe)

# Grid
boost_grid <- grid_regular(
  tree_depth(),
  learn_rate(range = c(0.01, 0.3)),
  loss_reduction(),
  levels = 3
)

library(tidymodels)
library(xgboost)

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

# Tune
boost_results <- tune_grid(
  boost_wf,
  resamples = folds,
  grid = boost_grid,
  metrics = metric_set(accuracy, roc_auc)
)

collect_metrics(boost_results)

This table shows the performance of various configurations of a boosted tree model for predicting exoplanet habitability, evaluated using cross-validation. The model with tree_depth = 1 and learn_rate ≈ 1.02 achieved a ROC AUC of 0.9204, making it one of the top-performing configurations. Additionally, it had an accuracy of 84.87%

Model Results, Visualization, and Comparison

library(tidymodels)
library(dplyr)

Finding Best Performing Model

show_best(rf_results, metric = "roc_auc")

The top five random forest model configurations are compiled in this table according to their average ROC AUC scores from 10-fold cross-validation. With the best mean ROC AUC score of almost 0.9282, the model with mtry = 1 and min_n = 4 demonstrated great performance in determining if an exoplanet is perhaps Earth-like. Consistent performance across folds is suggested by the low standard errors in all configurations, which range from 0.0019 to 0.0021.

Collecting AUC ROC from each Model

# Load necessary packages
library(tidymodels)

# Step 1: Select best models based on ROC AUC
best_rf <- select_best(rf_results, metric = "roc_auc")
best_boost <- select_best(boost_results, metric = "roc_auc")
best_tree <- select_best(tree_results, metric = "roc_auc")

# Step 2: Finalize workflows
final_rf_wf <- finalize_workflow(rf_wf, best_rf)
final_boost_wf <- finalize_workflow(boost_wf, best_boost)
final_tree_wf <- finalize_workflow(tree_wf, best_tree)

# Step 3: Fit on training data
final_rf_fit <- fit(final_rf_wf, data = train)
final_boost_fit <- fit(final_boost_wf, data = train)
final_tree_fit <- fit(final_tree_wf, data = train)
final_log_fit <- fit(log_wf, data = train)  # already defined logistic workflow

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Step 4: Function to get ROC AUC
get_auc <- function(model, model_name, test_data) {
  augment(model, new_data = test_data) %>%
    roc_auc(truth = planet, .pred_Yes) %>%
    mutate(Model = model_name)
}

# Step 5: Collect and display results
auc_results <- bind_rows(
  get_auc(final_rf_fit, "Random Forest", test),
  get_auc(final_boost_fit, "Boosted Tree", test),
  get_auc(final_tree_fit, "Decision Tree", test),
  get_auc(final_log_fit, "Logistic Regression", test)
)

# Optional: Sort by AUC
auc_results <- auc_results %>%
  arrange(desc(.estimate))

auc_results

Bar Plot Comparing ROC AUC

library(ggplot2)

ggplot(auc_results, aes(x = reorder(Model, .estimate), y = .estimate)) +
  geom_col(fill = "skyblue", color = "black", width = 0.6) +
  coord_flip() +
  labs(title = "Model Comparison (ROC AUC)", x = "Model", y = "ROC AUC") +
  theme_minimal()

This bar chart provides a more intuitive and digestible visual comparison of model performance. The clear differences in bar lengths help non-technical audiences (such as astronomers or funding reviewers) quickly grasp which algorithms are better suited for the classification task.

Lolipop Plot Comparing ROC AUC

library(ggplot2)

# First, rename the column for clarity
auc_results_clean <- auc_results %>%
  rename(ROC_AUC = .estimate)

# Plot: Lollipop chart
ggplot(auc_results_clean, aes(x = reorder(Model, ROC_AUC), y = ROC_AUC)) +
  geom_segment(aes(xend = Model, y = 0, yend = ROC_AUC), color = "gray") +
  geom_point(size = 5, color = "blue") +
  coord_flip() +
  labs(title = "Model ROC AUC Comparison",
       x = "Model",
       y = "ROC AUC Score") +
  theme_minimal()

This plot presents AUC results with a clear visual emphasis. It shows that the Decision Tree and Logistic Regression models are both strong contenders. As a reuslt, this might suggest that simpler models (like decision trees) can do just as well complex ones, which is useful when interpretability or computational efficiency is a concern. This will be seen later when the worst model’s AUC curve is compared to he best model (Decision Tree).

Best Model ROC Curve

We will now see how well the two best models ()

# ROC Curve for Decision Tree (Best Model)
tree_roc_curve <- augment(final_tree_fit, new_data = test) %>%
  roc_curve(planet, .pred_Yes)

autoplot(tree_roc_curve) +
  labs(title = "ROC Curve - Best Decision Tree Model")

# ROC Curve for Logistic Regression (Second Best Model)
log_roc_curve <- augment(final_log_fit, new_data = test) %>%
  roc_curve(planet, .pred_Yes)

autoplot(log_roc_curve) +
  labs(title = "ROC Curve - Logistic Regression Model")

The two ROC curves above visualize the true positive rate (sensitivity) against the false positive rate (1 - specificity) for the Decision Tree and Logistic Regression models, respectively. These curves are critical tools in assessing the predictive performance of classification models in binary classification tasks like identifying whether an exoplanet is habitable (positive class) or not. The ROC curve for the Decision Tree model shows a steep initial ascent, indicating strong sensitivity with relatively low false positive rates, which signifies it is a robust classifier in the early thresholds. The curve’s curvature toward the top-left corner implies that the model performs well in distinguishing between habitable and non-habitable exoplanets, with high true positive rates across most threshold values.

Whereas, the Logistic Regression model’s ROC curve, while still curving above the diagonal line of no-discrimination, is slightly flatter than the Decision Tree. This suggests that the Logistic Regression model has moderate predictive power but is less effective at separating the classes compared to the Decision Tree. Both models outperform random guessing which is indicated by the diagonal line, but the shape and curvature of the Decision Tree’s curve suggest that it may be better suited for this particular exoplanet classification task. This insight aligns with the ROC AUC scores reported in the project, reinforcing the conclusion that the Decision Tree is the model with the higher performance.

Conclusion

In this project, we developed and evaluated machine learning models to predict whether an exoplanet is likely to be habitable based on features such as orbital, stellar, and planetary features. After comparing various models using ROC AUC scores and accuracy on the testing set, the two final models selected for evaluation were the Decision Tree and Logistic Regression classifiers. These models were chosen due to their robust performance and consistent predictive accuracy during cross-validation. Among them, the Decision Tree model outperformed the Random Forest in terms of ROC AUC, suggesting that it was better at distinguishing between habitable and non-habitable planets.

The Decision Tree model’s ability to sequentially focus on misclassified observations allowed it to achieve a higher level of accuracy in our dataset. Although the difference in performance was not enormous, even a small gain in AUC can be meaningful in applications like exoplanet classification, where false positives or negatives could mislead scientific research.

For future work, more complex algorithms such as neural networks could be explored, particularly if paired with a larger dataset or additional engineered features. Furthermore, incorporating astrophysical data from newer space missions or fusing observational data with simulated planetary models could improve prediction accuracy with new data to do off of. A particularly exciting direction could be combining this classification framework with astronomical image recognition models to build an end-to-end exoplanet discovery and classification system.Therea re new planets being discovered everyday, and top research universities are working around the clock to find more. As Niel Degrasse Tyson once said, “There is no greater education that one self’s own drive”.

PSTAT 131 Final Project

Sudarshan Raj

2025-06-14

Introduction

Predicting Habitable Exoplanets

Introduction

Inspiration and Motive

Data Description

Project Outline

Data Source

Step 2: Exploratory Data Analysis

Exploratory Data Analysis (EDA)

Data Cleaning

Pairwise Relationships

Boxplot

Histogram

Correlation Heatmap

Appendix: Codebook

Step 3: Data Preprocessing for Modeling

Model Set-Up

Logistic Regression (Baseline Model)

Decision Tree

Random Forest

Boosted Tree

Model Results, Visualization, and Comparison

Finding Best Performing Model

Collecting AUC ROC from each Model

Bar Plot Comparing ROC AUC

Lolipop Plot Comparing ROC AUC

Best Model ROC Curve

Conclusion