Data-621 - Final Project (Predicting Exoplanet Equilibrium Temperature Using Regression Modeling)

Authors

Anthony Roman

Rupendra Shrestha

Bikash Bhowmik

Haoming Chen

Jerald Melukkar

Published

May 17, 2026

Abstract

This project aims at investigating whether planetary equilibrium temperature can be accurately predicted based on the properties of the stars, the orbits, and planets themselves. Based on the NASA Exoplanet Archive dataset, this research uses regression modeling techniques to find correlations between planetary equilibrium temperature and other parameters such as the star temperature, star mass, star radius, orbit distance, planet radius, and planet mass. In this study, all steps of a predictive modeling approach are followed, such as data cleansing, exploratory data analysis, feature selection, model construction, and model assessment. Multiple regression and machine learning approaches were used, including multiple linear regression, regularized regression, and random forest regression. Model performances were compared using RMSE, MAE, and \(R^2\).

Keywords

Exoplanets; Regression Modeling; Equilibrium Temperature; NASA Exoplanet Archive; Predictive Analytics

Introduction

An exoplanet is a planet orbiting around stars beyond our Solar System. Many exoplanets have been discovered since the initial findings of extrasolar planets were confirmed, using different techniques like transit photometry, radial velocity observations, direct imaging, and microlensing. The discoveries made so far have helped increase knowledge about planetary systems, and many questions about their formation, orbits, and habitability remain unanswered.

One of the features used to study planets is equilibrium temperature. Equilibrium temperature serves as an estimation of the temperature a planet would expect depending on energy coming from its host star. Though it cannot accurately tell about surface temperatures, it can be useful in determining which planets are situated at certain temperatures that may be considered habitable.

The goal of this research is to construct a model to predict exoplanet equilibrium temperature based on different measurable variables. Such a topic is appropriate for the task due to the nature of the outcome variable (numeric), as well as the availability of possible predictors provided by the NASA Exoplanet Archive.

Literature Review

Applications of machine learning and regression analysis are significant for astrophysics research involving exoplanet and planetary system analysis. For example, machine learning and statistical methods have been used to predict certain parameters of planets, classify exoplanets, and identify habitable planets based on observations obtained by astronomical surveys (James et al., 2021).

Regression analysis is usually utilized in establishing the relationship between the characteristics of stars and planets. Regression analysis is effective in analyzing linear relationships and testing predictor significance. However, some astrophysical systems feature non-linear relationships that require other approaches to analyze them (Kuhn & Johnson, 2013).

In more recent times, researchers have attempted machine learning approaches like random forests, artificial neural networks, and support vector machines to analyze exoplanets. Such methods are effective at recognizing complicated nonlinear connections between variables, and usually perform better than conventional regression approaches when analyzing observational data sets. The random forest algorithm is especially useful since it not only identifies nonlinear relationships but also assesses the relative importance of each predictor (Hastie, Tibshirani, & Friedman, 2009).

The current research contributes to the literature by applying both regression and machine learning algorithms to model the continuous equilibrium temperature of an exoplanet. The research differs from other studies which mainly concentrate on classification algorithms for modeling exoplanets. In addition, the study tries to determine whether there are any benefits to be gained by utilizing a nonlinear approach in machine learning algorithms.

Research Question

Which stellar, orbital, and planetary characteristics are most important for predicting exoplanet equilibrium temperature?

Methodology

In this analysis, data on confirmed exoplanets is used, sourced from the NASA Exoplanet Archive. This analysis seeks to predict the equilibrium temperature of an exoplanet based on certain predictor variables that characterize the planet and its star. Regression modeling techniques were used to analyze the relationship between equilibrium temperature and the selected predictor variables.

The data was extracted from the NASA Exoplanet Archive Planetary Systems Composite Parameters table, which contains data on confirmed exoplanet observations and their stellar system characteristics.

Data preparation and cleaning were performed in R, with predictor variables that had too many missing values eliminated and those observations of predictor variables with missing values removed from the modeling data set. Variable distributions and correlations were examined through exploratory data analysis.

Code
library(tidyverse)
library(caret)
library(glmnet)
library(randomForest)
library(broom)
library(corrplot)
Code
exoplanets <- read_csv(
  "exoplanets.csv",
  comment = "#"
)

glimpse(exoplanets)
Rows: 6,286
Columns: 84
$ pl_name         <chr> "11 Com b", "11 UMi b", "14 And b", "14 Her b", "16 Cy…
$ hostname        <chr> "11 Com", "11 UMi", "14 And", "14 Her", "16 Cyg B", "1…
$ sy_snum         <dbl> 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, …
$ sy_pnum         <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, …
$ discoverymethod <chr> "Radial Velocity", "Radial Velocity", "Radial Velocity…
$ disc_year       <dbl> 2007, 2009, 2008, 2002, 1996, 2020, 2008, 2008, 2018, …
$ disc_facility   <chr> "Xinglong Station", "Thueringer Landessternwarte Taute…
$ pl_controv_flag <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pl_orbper       <dbl> 323.21, 516.22, 186.76, 1766.41, 798.50, 578.38, 982.8…
$ pl_orbpererr1   <dbl> 0.06, 3.20, 0.11, 0.67, 1.00, 2.01, 1.06, NA, 0.00, 2.…
$ pl_orbpererr2   <dbl> -0.05, -3.20, -0.12, -0.68, -1.00, -2.09, -0.92, NA, -…
$ pl_orbperlim    <dbl> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, NA, NA, NA, NA, NA, …
$ pl_orbsmax      <dbl> 1.178, 1.530, 0.775, 2.839, 1.660, 1.450, 2.476, 330.0…
$ pl_orbsmaxerr1  <dbl> 0.000, 0.070, 0.000, 0.039, 0.030, 0.020, 0.002, NA, 0…
$ pl_orbsmaxerr2  <dbl> 0.000, -0.070, 0.000, -0.041, -0.030, -0.020, -0.002, …
$ pl_orbsmaxlim   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pl_rade         <dbl> 12.20000, 12.30000, 13.10000, 12.50000, 13.50000, 12.9…
$ pl_radeerr1     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_radeerr2     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_radelim      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pl_radj         <dbl> 1.090, 1.090, 1.160, 1.120, 1.200, 1.150, 1.120, 1.664…
$ pl_radjerr1     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_radjerr2     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_radjlim      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pl_bmasse       <dbl> 4914.8985, 4684.8142, 1131.1513, 2828.6728, 565.7374, …
$ pl_bmasseerr1   <dbl> 39.09289, 794.57500, 36.23244, 413.17693, 25.42640, 47…
$ pl_bmasseerr2   <dbl> -39.72855, -794.57500, -38.77507, -540.30829, -25.4264…
$ pl_bmasselim    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pl_bmassj       <dbl> 15.464, 14.740, 3.559, 8.900, 1.780, 4.320, 9.207, 8.0…
$ pl_bmassjerr1   <dbl> 0.123, 2.500, 0.114, 1.300, 0.080, 0.150, 0.160, 1.000…
$ pl_bmassjerr2   <dbl> -0.125, -2.500, -0.122, -1.700, -0.080, -0.120, -0.077…
$ pl_bmassjlim    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pl_bmassprov    <chr> "Msini", "Msini", "Msini", "Mass", "Msini", "Msini", "…
$ pl_orbeccen     <dbl> 0.2380, 0.0800, 0.0000, 0.3683, 0.6800, 0.0600, 0.0240…
$ pl_orbeccenerr1 <dbl> 0.0070, 0.0300, NA, 0.0029, 0.0200, 0.0300, 0.0070, NA…
$ pl_orbeccenerr2 <dbl> -0.0070, -0.0300, NA, -0.0029, -0.0200, -0.0200, -0.01…
$ pl_orbeccenlim  <dbl> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, NA, NA, NA, NA, NA, …
$ pl_insol        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_insolerr1    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_insolerr2    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_insollim     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_eqt          <dbl> NA, NA, NA, NA, NA, NA, NA, 1700, NA, NA, NA, 1450, NA…
$ pl_eqterr1      <dbl> NA, NA, NA, NA, NA, NA, NA, 100, NA, NA, NA, 50, NA, 1…
$ pl_eqterr2      <dbl> NA, NA, NA, NA, NA, NA, NA, -100, NA, NA, NA, -50, NA,…
$ pl_eqtlim       <dbl> NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, 0, NA, 0, N…
$ ttv_flag        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ st_spectype     <chr> "G8 III", "K4 III", "K0 III", "K0", "G3 V", "K3 III", …
$ st_teff         <dbl> 4874, 4213, 4888, 5338, 5750, 4157, 4980, 4060, 4816, …
$ st_tefferr1     <dbl> NA, 46, NA, 25, 8, 11, NA, 300, NA, 44, 44, 100, NA, 5…
$ st_tefferr2     <dbl> NA, -46, NA, -25, -8, -10, NA, -200, NA, -44, -44, -10…
$ st_tefflim      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0,…
$ st_rad          <dbl> 13.760000, 29.790000, 11.550000, 0.930000, 1.130000, 2…
$ st_raderr1      <dbl> 2.850000, 2.840000, 1.120000, 0.010000, 0.010000, 0.78…
$ st_raderr2      <dbl> -2.450000, -2.840000, -0.510000, -0.010000, -0.010000,…
$ st_radlim       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0,…
$ st_mass         <dbl> 2.090000, 2.780000, 1.780000, 0.970000, 1.080000, 1.22…
$ st_masserr1     <dbl> 0.640000, 0.690000, 0.430000, 0.040000, 0.040000, 0.13…
$ st_masserr2     <dbl> -0.630000, -0.690000, -0.290000, -0.040000, -0.040000,…
$ st_masslim      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ st_met          <dbl> -0.260, -0.020, -0.210, 0.430, 0.060, -0.010, -0.060, …
$ st_meterr1      <dbl> 0.10, NA, 0.10, 0.07, NA, 0.10, 0.10, NA, 0.10, 0.04, …
$ st_meterr2      <dbl> -0.10, NA, -0.10, -0.07, NA, -0.10, -0.10, NA, -0.10, …
$ st_metlim       <dbl> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, NA, NA, NA, NA, N…
$ st_metratio     <chr> "[Fe/H]", "[Fe/H]", "[Fe/H]", "[Fe/H]", "[Fe/H]", "[Fe…
$ st_logg         <dbl> 2.45000, 1.93000, 2.55000, 4.45000, 4.36000, 1.70000, …
$ st_loggerr1     <dbl> 0.080000, 0.070000, 0.060000, 0.020000, 0.010000, 0.04…
$ st_loggerr2     <dbl> -0.080000, -0.070000, -0.070000, -0.020000, -0.010000,…
$ st_logglim      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0,…
$ rastr           <chr> "12h20m42.91s", "15h17m05.90s", "23h31m17.80s", "16h10…
$ ra              <dbl> 185.17878, 229.27460, 352.82415, 242.60210, 295.46564,…
$ decstr          <chr> "+17d47m35.71s", "+71d49m26.19s", "+39d14m09.01s", "+4…
$ dec             <dbl> 17.7932516, 71.8239428, 39.2358367, 43.8163621, 50.516…
$ sy_dist         <dbl> 93.18460, 125.32100, 75.43920, 17.93230, 21.13970, 124…
$ sy_disterr1     <dbl> 1.9238000, 1.9765000, 0.7140000, 0.0073000, 0.0110000,…
$ sy_disterr2     <dbl> -1.9238000, -1.9765000, -0.7140000, -0.0073000, -0.011…
$ sy_vmag         <dbl> 4.72307, 5.01300, 5.23133, 6.61935, 6.21500, 5.22606, …
$ sy_vmagerr1     <dbl> 0.023, 0.005, 0.023, 0.023, 0.016, 0.023, 0.023, 0.069…
$ sy_vmagerr2     <dbl> -0.023, -0.005, -0.023, -0.023, -0.016, -0.023, -0.023…
$ sy_kmag         <dbl> 2.282000, 1.939000, 2.331000, 4.714000, 4.651000, 2.09…
$ sy_kmagerr1     <dbl> 0.346, 0.270, 0.240, 0.016, 0.016, 0.244, 0.204, 0.021…
$ sy_kmagerr2     <dbl> -0.346, -0.270, -0.240, -0.016, -0.016, -0.244, -0.204…
$ sy_gaiamag      <dbl> 4.44038, 4.56216, 4.91781, 6.38300, 6.06428, 4.75429, …
$ sy_gaiamagerr1  <dbl> 0.0038479, 0.0039035, 0.0028262, 0.0003512, 0.0006029,…
$ sy_gaiamagerr2  <dbl> -0.0038479, -0.0039035, -0.0028262, -0.0003512, -0.000…
Code
colnames(exoplanets)
 [1] "pl_name"         "hostname"        "sy_snum"         "sy_pnum"        
 [5] "discoverymethod" "disc_year"       "disc_facility"   "pl_controv_flag"
 [9] "pl_orbper"       "pl_orbpererr1"   "pl_orbpererr2"   "pl_orbperlim"   
[13] "pl_orbsmax"      "pl_orbsmaxerr1"  "pl_orbsmaxerr2"  "pl_orbsmaxlim"  
[17] "pl_rade"         "pl_radeerr1"     "pl_radeerr2"     "pl_radelim"     
[21] "pl_radj"         "pl_radjerr1"     "pl_radjerr2"     "pl_radjlim"     
[25] "pl_bmasse"       "pl_bmasseerr1"   "pl_bmasseerr2"   "pl_bmasselim"   
[29] "pl_bmassj"       "pl_bmassjerr1"   "pl_bmassjerr2"   "pl_bmassjlim"   
[33] "pl_bmassprov"    "pl_orbeccen"     "pl_orbeccenerr1" "pl_orbeccenerr2"
[37] "pl_orbeccenlim"  "pl_insol"        "pl_insolerr1"    "pl_insolerr2"   
[41] "pl_insollim"     "pl_eqt"          "pl_eqterr1"      "pl_eqterr2"     
[45] "pl_eqtlim"       "ttv_flag"        "st_spectype"     "st_teff"        
[49] "st_tefferr1"     "st_tefferr2"     "st_tefflim"      "st_rad"         
[53] "st_raderr1"      "st_raderr2"      "st_radlim"       "st_mass"        
[57] "st_masserr1"     "st_masserr2"     "st_masslim"      "st_met"         
[61] "st_meterr1"      "st_meterr2"      "st_metlim"       "st_metratio"    
[65] "st_logg"         "st_loggerr1"     "st_loggerr2"     "st_logglim"     
[69] "rastr"           "ra"              "decstr"          "dec"            
[73] "sy_dist"         "sy_disterr1"     "sy_disterr2"     "sy_vmag"        
[77] "sy_vmagerr1"     "sy_vmagerr2"     "sy_kmag"         "sy_kmagerr1"    
[81] "sy_kmagerr2"     "sy_gaiamag"      "sy_gaiamagerr1"  "sy_gaiamagerr2" 
Code
summary(exoplanets)
   pl_name            hostname            sy_snum         sy_pnum     
 Length:6286        Length:6286        Min.   :1.000   Min.   :1.000  
 Class :character   Class :character   1st Qu.:1.000   1st Qu.:1.000  
 Mode  :character   Mode  :character   Median :1.000   Median :1.000  
                                       Mean   :1.101   Mean   :1.762  
                                       3rd Qu.:1.000   3rd Qu.:2.000  
                                       Max.   :4.000   Max.   :8.000  
                                                                      
 discoverymethod      disc_year    disc_facility      pl_controv_flag   
 Length:6286        Min.   :1992   Length:6286        Min.   :0.000000  
 Class :character   1st Qu.:2014   Class :character   1st Qu.:0.000000  
 Mode  :character   Median :2016   Mode  :character   Median :0.000000  
                    Mean   :2017                      Mean   :0.008591  
                    3rd Qu.:2021                      3rd Qu.:0.000000  
                    Max.   :2026                      Max.   :1.000000  
                    NA's   :1                                           
   pl_orbper         pl_orbpererr1       pl_orbpererr2       
 Min.   :        0   Min.   :        0   Min.   :-100000000  
 1st Qu.:        4   1st Qu.:        0   1st Qu.:         0  
 Median :       11   Median :        0   Median :         0  
 Mean   :    71986   Mean   :    87738   Mean   :    -20041  
 3rd Qu.:       38   3rd Qu.:        0   3rd Qu.:         0  
 Max.   :402000000   Max.   :470000000   Max.   :         0  
 NA's   :340         NA's   :830         NA's   :830         
  pl_orbperlim        pl_orbsmax        pl_orbsmaxerr1     
 Min.   :-1.00000   Min.   :4.400e-03   Min.   :   0.0000  
 1st Qu.: 0.00000   1st Qu.:5.230e-02   1st Qu.:   0.0007  
 Median : 0.00000   Median :1.021e-01   Median :   0.0020  
 Mean   :-0.00101   Mean   :1.566e+01   Mean   :   1.8013  
 3rd Qu.: 0.00000   3rd Qu.:3.047e-01   3rd Qu.:   0.0155  
 Max.   : 0.00000   Max.   :1.900e+04   Max.   :5205.0000  
 NA's   :340        NA's   :426         NA's   :2364       
 pl_orbsmaxerr2       pl_orbsmaxlim         pl_rade         pl_radeerr1     
 Min.   :-2.060e+03   Min.   :-1.00000   Min.   : 0.3098   Min.   : 0.0000  
 1st Qu.:-1.600e-02   1st Qu.: 0.00000   1st Qu.: 1.8400   1st Qu.: 0.1273  
 Median :-2.000e-03   Median : 0.00000   Median : 2.8466   Median : 0.2700  
 Mean   :-9.593e-01   Mean   :-0.00051   Mean   : 5.8078   Mean   : 0.5144  
 3rd Qu.:-7.300e-04   3rd Qu.: 0.00000   3rd Qu.:11.9000   3rd Qu.: 0.5500  
 Max.   : 0.000e+00   Max.   : 0.00000   Max.   :87.2059   Max.   :68.9100  
 NA's   :2364         NA's   :425        NA's   :50        NA's   :1935     
  pl_radeerr2         pl_radelim           pl_radj         pl_radjerr1     
 Min.   :-32.5061   Min.   :-1.000000   Min.   :0.02764   Min.   :0.00000  
 1st Qu.: -0.4500   1st Qu.: 0.000000   1st Qu.:0.16415   1st Qu.:0.01128  
 Median : -0.2200   Median : 0.000000   Median :0.25383   Median :0.02409  
 Mean   : -0.4135   Mean   :-0.000641   Mean   :0.51816   Mean   :0.04589  
 3rd Qu.: -0.1100   3rd Qu.: 0.000000   3rd Qu.:1.06000   3rd Qu.:0.04907  
 Max.   :  0.0000   Max.   : 0.000000   Max.   :7.78000   Max.   :6.14774  
 NA's   :1935       NA's   :50          NA's   :50        NA's   :1935     
  pl_radjerr2         pl_radjlim          pl_bmasse        pl_bmasseerr1     
 Min.   :-2.90000   Min.   :-1.000000   Min.   :   0.020   Min.   :    0.00  
 1st Qu.:-0.04000   1st Qu.: 0.000000   1st Qu.:   4.275   1st Qu.:    2.20  
 Median :-0.01963   Median : 0.000000   Median :   9.200   Median :   18.12  
 Mean   :-0.03690   Mean   :-0.000641   Mean   : 400.705   Mean   :  173.37  
 3rd Qu.:-0.00981   3rd Qu.: 0.000000   3rd Qu.: 182.911   3rd Qu.:   77.00  
 Max.   : 0.00000   Max.   : 0.000000   Max.   :9534.852   Max.   :12652.75  
 NA's   :1935       NA's   :50          NA's   :31         NA's   :3273      
 pl_bmasseerr2       pl_bmasselim        pl_bmassj         pl_bmassjerr1    
 Min.   :-6038.74   Min.   :-1.00000   Min.   :6.290e-05   Min.   : 0.0000  
 1st Qu.:  -69.92   1st Qu.: 0.00000   1st Qu.:1.343e-02   1st Qu.: 0.0069  
 Median :  -16.84   Median : 0.00000   Median :2.895e-02   Median : 0.0580  
 Mean   : -126.42   Mean   : 0.03229   Mean   :1.261e+00   Mean   : 0.5459  
 3rd Qu.:   -2.04   3rd Qu.: 0.00000   3rd Qu.:5.755e-01   3rd Qu.: 0.2410  
 Max.   :    0.00   Max.   : 1.00000   Max.   :3.000e+01   Max.   :39.8100  
 NA's   :3273       NA's   :31         NA's   :31          NA's   :3273     
 pl_bmassjerr2       pl_bmassjlim      pl_bmassprov        pl_orbeccen     
 Min.   :-19.0000   Min.   :-1.00000   Length:6286        Min.   :0.00000  
 1st Qu.: -0.2200   1st Qu.: 0.00000   Class :character   1st Qu.:0.00000  
 Median : -0.0530   Median : 0.00000   Mode  :character   Median :0.00000  
 Mean   : -0.3979   Mean   : 0.03229                      Mean   :0.07927  
 3rd Qu.: -0.0064   3rd Qu.: 0.00000                      3rd Qu.:0.09000  
 Max.   :  0.0000   Max.   : 1.00000                      Max.   :0.95000  
 NA's   :3273       NA's   :31                            NA's   :1049     
 pl_orbeccenerr1  pl_orbeccenerr2   pl_orbeccenlim        pl_insol       
 Min.   :0.0000   Min.   :-0.7030   Min.   :-1.00000   Min.   :    0.00  
 1st Qu.:0.0200   1st Qu.:-0.0788   1st Qu.: 0.00000   1st Qu.:   24.10  
 Median :0.0480   Median :-0.0400   Median : 0.00000   Median :   99.99  
 Mean   :0.0676   Mean   :-0.0548   Mean   : 0.05232   Mean   :  419.35  
 3rd Qu.:0.0900   3rd Qu.:-0.0170   3rd Qu.: 0.00000   3rd Qu.:  376.07  
 Max.   :0.5000   Max.   : 0.0000   Max.   : 1.00000   Max.   :44900.00  
 NA's   :4408     NA's   :4408      NA's   :1049       NA's   :1880      
  pl_insolerr1       pl_insolerr2        pl_insollim       pl_eqt      
 Min.   :   0.000   Min.   :-7200.000   Min.   :0      Min.   :  34.0  
 1st Qu.:   1.341   1st Qu.:  -22.008   1st Qu.:0      1st Qu.: 569.0  
 Median :   5.905   Median :   -5.225   Median :0      Median : 823.0  
 Mean   :  41.333   Mean   :  -32.019   Mean   :0      Mean   : 914.5  
 3rd Qu.:  26.389   3rd Qu.:   -1.253   3rd Qu.:0      3rd Qu.:1163.0  
 Max.   :8100.000   Max.   :    0.000   Max.   :0      Max.   :4050.0  
 NA's   :2644       NA's   :2644        NA's   :1880   NA's   :1601    
   pl_eqterr1        pl_eqterr2         pl_eqtlim          ttv_flag      
 Min.   :   0.00   Min.   :-1217.00   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:  11.25   1st Qu.:  -34.00   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :  20.00   Median :  -19.21   Median :0.00000   Median :0.00000  
 Mean   :  31.39   Mean   :  -30.84   Mean   :0.00064   Mean   :0.07811  
 3rd Qu.:  35.00   3rd Qu.:  -11.00   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1217.00   Max.   :   -0.32   Max.   :1.00000   Max.   :1.00000  
 NA's   :4459      NA's   :4459       NA's   :1601                       
 st_spectype           st_teff       st_tefferr1       st_tefferr2      
 Length:6286        Min.   :  415   Min.   :   1.00   Min.   :-2360.24  
 Class :character   1st Qu.: 4896   1st Qu.:  59.83   1st Qu.: -124.62  
 Mode  :character   Median : 5542   Median :  86.00   Median :  -86.62  
                    Mean   : 5393   Mean   :  98.50   Mean   :  -99.22  
                    3rd Qu.: 5897   3rd Qu.: 120.36   3rd Qu.:  -58.55  
                    Max.   :57000   Max.   :1763.10   Max.   :   -1.00  
                    NA's   :294     NA's   :607       NA's   :607       
   st_tefflim           st_rad          st_raderr1         st_raderr2       
 Min.   :0.000000   Min.   : 0.0115   Min.   :  0.0000   Min.   :-28.97000  
 1st Qu.:0.000000   1st Qu.: 0.7700   1st Qu.:  0.0230   1st Qu.: -0.10500  
 Median :0.000000   Median : 0.9500   Median :  0.0500   Median : -0.05000  
 Mean   :0.000167   Mean   : 1.4911   Mean   :  0.1876   Mean   : -0.14659  
 3rd Qu.:0.000000   3rd Qu.: 1.2400   3rd Qu.:  0.1640   3rd Qu.: -0.02279  
 Max.   :1.000000   Max.   :88.4750   Max.   :104.5280   Max.   :  0.00000  
 NA's   :294        NA's   :318       NA's   :590        NA's   :591        
   st_radlim      st_mass         st_masserr1        st_masserr2       
 Min.   :0     Min.   : 0.0094   Min.   : 0.00020   Min.   :-10.02000  
 1st Qu.:0     1st Qu.: 0.7700   1st Qu.: 0.02900   1st Qu.: -0.08000  
 Median :0     Median : 0.9400   Median : 0.04800   Median : -0.04900  
 Mean   :0     Mean   : 0.9355   Mean   : 0.08517   Mean   : -0.07891  
 3rd Qu.:0     3rd Qu.: 1.0830   3rd Qu.: 0.08900   3rd Qu.: -0.02900  
 Max.   :0     Max.   :10.9400   Max.   :10.02000   Max.   :  0.00000  
 NA's   :318   NA's   :9         NA's   :285        NA's   :286        
   st_masslim     st_met           st_meterr1        st_meterr2     
 Min.   :0    Min.   :-1.00000   Min.   :0.00100   Min.   :-1.0000  
 1st Qu.:0    1st Qu.:-0.08000   1st Qu.:0.04000   1st Qu.:-0.1500  
 Median :0    Median : 0.02000   Median :0.08120   Median :-0.0817  
 Mean   :0    Mean   : 0.01658   Mean   :0.09647   Mean   :-0.1022  
 3rd Qu.:0    3rd Qu.: 0.13000   3rd Qu.:0.14000   3rd Qu.:-0.0400  
 Max.   :0    Max.   : 0.79000   Max.   :0.96000   Max.   : 0.0000  
 NA's   :9    NA's   :640        NA's   :924       NA's   :924      
   st_metlim   st_metratio           st_logg       st_loggerr1     
 Min.   :-1    Length:6286        Min.   :0.541   Min.   :0.00000  
 1st Qu.: 0    Class :character   1st Qu.:4.294   1st Qu.:0.02900  
 Median : 0    Mode  :character   Median :4.458   Median :0.05500  
 Mean   : 0                       Mean   :4.386   Mean   :0.08246  
 3rd Qu.: 0                       3rd Qu.:4.580   3rd Qu.:0.10300  
 Max.   : 1                       Max.   :8.070   Max.   :1.10000  
 NA's   :640                      NA's   :322     NA's   :613      
  st_loggerr2         st_logglim          rastr                 ra          
 Min.   :-3.51000   Min.   :-1.00000   Length:6286        Min.   :  0.1856  
 1st Qu.:-0.15000   1st Qu.: 0.00000   Class :character   1st Qu.:169.4479  
 Median :-0.07200   Median : 0.00000   Mode  :character   Median :284.4335  
 Mean   :-0.09688   Mean   :-0.00101                      Mean   :231.2644  
 3rd Qu.:-0.03000   3rd Qu.: 0.00000                      3rd Qu.:293.1234  
 Max.   : 0.00000   Max.   : 0.00000                      Max.   :359.9750  
 NA's   :613        NA's   :322                                             
    decstr               dec            sy_dist          sy_disterr1       
 Length:6286        Min.   :-89.47   Min.   :   1.301   Min.   :3.400e-04  
 Class :character   1st Qu.:-12.05   1st Qu.: 101.730   1st Qu.:4.220e-01  
 Mode  :character   Median : 39.04   Median : 362.715   Median :3.781e+00  
                    Mean   : 17.71   Mean   : 704.443   Mean   :5.916e+01  
                    3rd Qu.: 45.41   3rd Qu.: 826.074   3rd Qu.:1.668e+01  
                    Max.   : 88.83   Max.   :8500.000   Max.   :2.600e+03  
                                     NA's   :27         NA's   :135        
  sy_disterr2            sy_vmag        sy_vmagerr1       sy_vmagerr2       
 Min.   :-2.840e+03   Min.   : 0.872   Min.   :0.00100   Min.   :-11.92000  
 1st Qu.:-1.621e+01   1st Qu.:10.700   1st Qu.:0.03000   1st Qu.: -0.12600  
 Median :-3.725e+00   Median :13.193   Median :0.06900   Median : -0.06900  
 Mean   :-6.636e+01   Mean   :12.541   Mean   :0.09716   Mean   : -0.09863  
 3rd Qu.:-4.200e-01   3rd Qu.:14.915   3rd Qu.:0.12600   3rd Qu.: -0.03000  
 Max.   :-3.500e-04   Max.   :44.610   Max.   :3.10000   Max.   : -0.00100  
 NA's   :135          NA's   :299      NA's   :307       NA's   :312        
    sy_kmag        sy_kmagerr1       sy_kmagerr2         sy_gaiamag    
 Min.   :-3.044   Min.   :0.01100   Min.   :-9.99500   Min.   : 2.364  
 1st Qu.: 8.406   1st Qu.:0.02000   1st Qu.:-0.03000   1st Qu.:10.424  
 Median :11.056   Median :0.02300   Median :-0.02300   Median :12.920  
 Mean   :10.372   Mean   :0.04096   Mean   :-0.04098   Mean   :12.248  
 3rd Qu.:12.706   3rd Qu.:0.03000   3rd Qu.:-0.02000   3rd Qu.:14.660  
 Max.   :33.110   Max.   :9.99500   Max.   :-0.01100   Max.   :20.186  
 NA's   :286      NA's   :323       NA's   :334        NA's   :343     
 sy_gaiamagerr1    sy_gaiamagerr2    
 Min.   :0.00011   Min.   :-0.06323  
 1st Qu.:0.00026   1st Qu.:-0.00054  
 Median :0.00036   Median :-0.00036  
 Mean   :0.00064   Mean   :-0.00064  
 3rd Qu.:0.00054   3rd Qu.:-0.00026  
 Max.   :0.06323   Max.   :-0.00011  
 NA's   :343       NA's   :343       

Dataset Variables

Exoplanet equilibrium temperature (pl_eqt) was used as the response variable for this study. Predictor variables have been chosen according to their anticipated physical relationship with planetary temperature.

Selected predictor variables include:

  • Stellar effective temperature (st_teff)
  • Stellar mass (st_mass)
  • Stellar radius (st_rad)
  • Semi-major axis (pl_orbsmax)
  • Orbital period (pl_orbper)
  • Planet radius (pl_rade)
  • Planet mass (pl_bmasse)
  • Orbital eccentricity (pl_orbeccen)
Code
exo_clean <- exoplanets %>%
  select(
    pl_eqt,
    st_teff,
    st_mass,
    st_rad,
    pl_orbsmax,
    pl_orbper,
    pl_rade,
    pl_bmasse,
    pl_orbeccen
  ) %>%
  drop_na()

dim(exo_clean)
[1] 4106    9
Code
summary(exo_clean)
     pl_eqt          st_teff         st_mass           st_rad      
 Min.   :  55.9   Min.   : 2566   Min.   :0.0898   Min.   :0.0131  
 1st Qu.: 568.0   1st Qu.: 5080   1st Qu.:0.8110   1st Qu.:0.7820  
 Median : 822.0   Median : 5618   Median :0.9500   Median :0.9495  
 Mean   : 911.9   Mean   : 5426   Mean   :0.9473   Mean   :1.0231  
 3rd Qu.:1166.0   3rd Qu.: 5946   3rd Qu.:1.0810   3rd Qu.:1.2007  
 Max.   :4050.0   Max.   :10170   Max.   :2.7800   Max.   :6.3000  
   pl_orbsmax          pl_orbper            pl_rade          pl_bmasse        
 Min.   : 0.005626   Min.   :1.800e-01   Min.   : 0.3098   Min.   :   0.0374  
 1st Qu.: 0.048000   1st Qu.:3.947e+00   1st Qu.: 1.6400   1st Qu.:   3.5000  
 Median : 0.079500   Median :8.759e+00   Median : 2.4900   Median :   7.2700  
 Mean   : 0.205912   Mean   :1.273e+02   Mean   : 4.5440   Mean   : 147.4653  
 3rd Qu.: 0.147295   3rd Qu.:2.122e+01   3rd Qu.: 4.3407   3rd Qu.:  26.6994  
 Max.   :63.000000   Max.   :1.170e+05   Max.   :25.0000   Max.   :8899.1954  
  pl_orbeccen     
 Min.   :0.00000  
 1st Qu.:0.00000  
 Median :0.00000  
 Mean   :0.04151  
 3rd Qu.:0.01300  
 Max.   :0.93183  

Exploratory Data Analysis

Code
colSums(is.na(exo_clean))

Missing data were also assessed in the dataset after the preprocessing stage. No missing values existed in the selected attributes in the final cleaned dataset.

Code
ggplot(exo_clean, aes(x = pl_eqt)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Exoplanet Equilibrium Temperature",
    x = "Equilibrium Temperature (K)",
    y = "Count"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Distribution of Exoplanet Equilibrium Temperature

The exoplanet equilibrium temperature distribution is skewed to the right. The majority of the planets have equilibrium temperatures ranging from around 400 K to 1200 K. There are also a few planets with very high equilibrium temperatures, causing the distribution to have a long right tail. This implies the existence of potential outliers, implying that certain planetary systems are exposed to a much higher amount of radiation from their stars compared to other planetary systems.

Code
cor_matrix <- cor(exo_clean)

corrplot(
  cor_matrix,
  method = "color",
  tl.cex = 0.7
)

Correlation Matrix of Selected Variables

A number of significant relationships can be found from the correlation matrix. The equilibrium temperature of exoplanets (pl_eqt) is positively correlated with stellar temperature, stellar mass, stellar radius, and planetary radius. Such relationships imply that the higher the stellar temperature, stellar mass, stellar radius, and planetary radius, the higher the equilibrium temperature of planets.

In addition, there exist very high positive correlations between several stellar features, especially stellar temperature, stellar mass, and stellar radius. These relationships imply multicollinearity between predictor variables that might influence the stability of the coefficients in the multiple linear regression models. Thus, regularization methods such as ridge and lasso regression can be used later during model training.

Finally, orbital distance (pl_orbsmax) and orbital period (pl_orbper) have a very high positive correlation. It corresponds to well-known relationships of orbital mechanics.

Code
ggplot(exo_clean, aes(x = st_teff, y = pl_eqt)) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Equilibrium Temperature vs Stellar Temperature",
    x = "Stellar Temperature",
    y = "Equilibrium Temperature"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Equilibrium Temperature vs Stellar Temperature

The scatterplot indicates a positive relationship between stellar temperature and exoplanet equilibrium temperature. Planets orbiting hotter stars generally exhibit higher equilibrium temperatures, which is consistent with the physical expectation that hotter stars emit greater amounts of stellar radiation.

The relationship also displays increasing variability at higher stellar temperatures, suggesting that additional planetary and orbital factors influence equilibrium temperature. Several extreme observations are also visible, indicating the presence of potentially influential outliers within the dataset.

Code
ggplot(exo_clean, aes(x = pl_orbsmax, y = pl_eqt)) +
  geom_point(alpha = 0.5) +
  scale_x_log10() +
  labs(
    title = "Equilibrium Temperature vs Orbital Distance",
    x = "Semi-Major Axis (log scale)",
    y = "Equilibrium Temperature"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Equilibrium Temperature vs Orbital Distance

From the log-scale scatter plot, there appears to exist a very clear negative correlation between orbital distance and exoplanet equilibrium temperature. Exoplanets that lie nearer to their stars have considerably high equilibrium temperatures because of the high levels of stellar energy received by such planets. The higher the orbital distance, the lower the equilibrium temperature.

In an attempt to represent the extensive range of orbital distances, I used a logarithm scale for the orbital distance or semi-major axis on the graph. From the curve formed from the scatter plot, it is evident that the orbital distance will play a very crucial role in prediction within the regression model.

Regression Modeling

Code
set.seed(621)

train_index <- createDataPartition(
  exo_clean$pl_eqt,
  p = 0.8,
  list = FALSE
)

train_data <- exo_clean[train_index, ]
test_data  <- exo_clean[-train_index, ]

The data was split into two sets, that is, the training set and the testing set, through an 80/20 split. The training set was applied for modeling purposes while the testing set was employed for testing purposes only.

Code
lm_model <- lm(
  pl_eqt ~ st_teff +
    st_mass +
    st_rad +
    pl_orbsmax +
    pl_orbper +
    pl_rade +
    pl_bmasse +
    pl_orbeccen,
  data = train_data
)

broom::tidy(lm_model) %>%
  knitr::kable(
    digits = 3,
    caption = "Multiple Linear Regression Coefficients"
  )
Multiple Linear Regression Coefficients
term estimate std.error statistic p.value
(Intercept) 45.591 54.670 0.834 0.404
st_teff 0.053 0.017 3.082 0.002
st_mass 426.962 64.943 6.574 0.000
st_rad 90.816 24.327 3.733 0.000
pl_orbsmax -254.960 15.275 -16.691 0.000
pl_orbper 0.138 0.009 15.435 0.000
pl_rade 27.796 1.622 17.133 0.000
pl_bmasse 0.044 0.012 3.652 0.000
pl_orbeccen -446.508 59.683 -7.481 0.000

The results of the multiple linear regression showed a set of statistically significant variables affecting exoplanet equilibrium temperature. Stellar temperature, stellar mass, and stellar radius were positively correlated with the dependent variable, meaning that the equilibrium temperature was higher in planetary systems around more massive and hotter stars.

A strong negative correlation between orbital distance (pl_orbsmax) and equilibrium temperature meant that more distant planets received less radiation from their host stars and had, accordingly, lower temperatures. Negative correlation was also observed for orbital eccentricity and equilibrium temperature.

Positive correlations were found for both planetary radius and planetary mass, implying that the equilibrium temperature was higher on average in planetary systems featuring larger planets within the sample.

Statistical significance was reached by most of the predictors at the 0.05 level.

Code
lm_preds <- predict(
  lm_model,
  newdata = test_data
)
Code
rmse_lm <- RMSE(lm_preds, test_data$pl_eqt)

mae_lm <- MAE(lm_preds, test_data$pl_eqt)

r2_lm <- R2(lm_preds, test_data$pl_eqt)

metrics_lm <- data.frame(
  Model = "Linear Regression",
  RMSE = rmse_lm,
  MAE = mae_lm,
  `$R^2$` = r2_lm,
  check.names = FALSE
)

knitr::kable(
  metrics_lm,
  digits = 3,
  caption = "Linear Regression Performance Metrics"
)
Linear Regression Performance Metrics
Model RMSE MAE \(R^2\)
Linear Regression 350.006 259.251 0.427

According to the multiple linear regression model, the testing dataset results are moderate in terms of prediction accuracy, with an \(R^2\) score of approximately 0.427, meaning the regression model accounts for about 43% of the variation in the equilibrium temperature of the exoplanets.

The obtained values for the Root Mean Squared Error (RMSE) and the Mean Absolute Error (MAE) show that errors will still be evident in predictions. Despite the numerous dependencies and relationships that are captured by the linear regression model, other factors may contribute to the equilibrium temperature beyond those considered within the linear model.

Code
plot_df <- data.frame(
  Actual = test_data$pl_eqt,
  Predicted = lm_preds
)

ggplot(plot_df, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.5) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "red"
  ) +
  labs(
    title = "Predicted vs Actual Equilibrium Temperature",
    x = "Actual Equilibrium Temperature",
    y = "Predicted Equilibrium Temperature"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Predicted vs Actual Equilibrium Temperature

The comparison of the predicted against the actual shows that the multiple linear regression model explains the overall relationship between the observed values and the predicted equilibrium temperature values. Most observations lie within the vicinity of the reference line, implying that the model has decent prediction capabilities for a majority of the data.

The distance among the observations becomes wider as equilibrium temperatures rise, showing that the model faces some challenges in predicting planets with more extreme temperatures. There are also some observations which lie far away from the reference line, implying that there may be some outliers in the data.

Code
residual_df <- data.frame(
  Predicted = lm_preds,
  Residuals = test_data$pl_eqt - lm_preds
)

ggplot(residual_df, aes(x = Predicted, y = Residuals)) +
  geom_point(alpha = 0.5) +
  geom_hline(
    yintercept = 0,
    color = "red"
  ) +
  labs(
    title = "Residual Plot",
    x = "Predicted Values",
    y = "Residuals"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Residual Plot for Multiple Linear Regression

The plot of residuals illustrates that the residuals are scattered around zero values, thus demonstrating that the linear regression approach identifies most of the fundamental relations between variables. There does not seem to be any noticeable pattern, which further indicates the correctness of the chosen regression method.

Nevertheless, there is an increase in the variance of the residuals in some cases of higher estimated equilibrium temperatures, implying possible heteroscedasticity and nonlinear relationships between variables. There are also several high residuals, implying that some planetary systems are not captured by the linear regression approach.

Advanced Regression Models

Code
x_train <- model.matrix(
  pl_eqt ~ .,
  train_data
)[, -1]

y_train <- train_data$pl_eqt

x_test <- model.matrix(
  pl_eqt ~ .,
  test_data
)[, -1]

y_test <- test_data$pl_eqt
Code
ridge_model <- cv.glmnet(
  x_train,
  y_train,
  alpha = 0
)
Code
ridge_preds <- predict(
  ridge_model,
  s = ridge_model$lambda.min,
  newx = x_test
)
Code
ridge_preds <- as.numeric(ridge_preds)

rmse_ridge <- RMSE(ridge_preds, y_test)

mae_ridge <- MAE(ridge_preds, y_test)

r2_ridge <- R2(ridge_preds, y_test)

ridge_metrics <- data.frame(
  Model = "Ridge Regression",
  RMSE = rmse_ridge,
  MAE = mae_ridge,
  `$R^2$` = r2_ridge,
  check.names = FALSE
)

knitr::kable(
  ridge_metrics,
  digits = 3,
  caption = "Ridge Regression Performance Metrics"
)
Ridge Regression Performance Metrics
Model RMSE MAE \(R^2\)
Ridge Regression 360.803 268.428 0.388

The ridge regression model exhibited predictive accuracy similar to the standard multiple linear regression model, although the model had slightly less predictive accuracy in terms of RMSE, MAE, and \(R^2\) values. This indicates that even though there is multicollinearity between several variables, the technique of shrinking the coefficients did not have much effect on the performance of the model.

Even though the predictive accuracy was slightly reduced, ridge regression is still helpful in stabilizing coefficients.

Code
lasso_preds <- predict(
  lasso_model,
  s = lasso_model$lambda.min,
  newx = x_test
)

lasso_preds <- as.numeric(lasso_preds)
Code
rmse_lasso <- RMSE(lasso_preds, y_test)

mae_lasso <- MAE(lasso_preds, y_test)

r2_lasso <- R2(lasso_preds, y_test)

lasso_metrics <- data.frame(
  Model = "Lasso Regression",
  RMSE = rmse_lasso,
  MAE = mae_lasso,
  `$R^2$` = r2_lasso,
  check.names = FALSE
)

knitr::kable(
  lasso_metrics,
  digits = 3,
  caption = "Lasso Regression Performance Metrics"
)
Lasso Regression Performance Metrics
Model RMSE MAE \(R^2\)
Lasso Regression 365.669 272.591 0.372

Lasso regression was found to perform less well compared to the baseline regression and ridge regression methods. The poor predictive performance indicates that the process of pushing coefficients toward zero has resulted in the loss of information needed to make accurate predictions.

In spite of its poor predictive performance, lasso regression helps us understand the importance of different features in terms of their impact on the dependent variable. Some predictors still possess large coefficients, such as the star’s mass, radius, distance, and planet’s radius.

Code
lasso_coef <- as.matrix(
  coef(lasso_model, s = "lambda.min")
)

lasso_coef_df <- data.frame(
  Variable = rownames(lasso_coef),
  Coefficient = lasso_coef[,1],
  row.names = NULL
)

knitr::kable(
  lasso_coef_df,
  digits = 3,
  caption = "Lasso Regression Coefficients"
)
Lasso Regression Coefficients
Variable Coefficient
(Intercept) 116.587
st_teff 0.030
st_mass 485.170
st_rad 73.219
pl_orbsmax -61.009
pl_orbper 0.025
pl_rade 26.849
pl_bmasse 0.005
pl_orbeccen -399.828

Coefficient shrinkage was used in the lasso regression model to minimize the complexity of the model and automate the process of feature selection. Some of the predictors showed considerable weights, which include stellar mass, stellar radius, orbital distance, planetary radius, and orbital eccentricity, and thus these variables still contribute significantly to the prediction of the equilibrium temperature.

Predictors like orbital period (pl_orbper) and planetary mass (pl_bmasse) have been shrunk significantly to near zero. This implies that these predictors did not contribute significantly to the predictive power of the regression model after other predictors.

From this feature selection process, lasso regression can be deemed useful in finding the significant predictors for regression analysis.

Code
set.seed(621)

rf_model <- randomForest(
  pl_eqt ~ .,
  data = train_data,
  importance = TRUE
)
Code
rf_preds <- predict(
  rf_model,
  newdata = test_data
)
Code
rmse_rf <- RMSE(rf_preds, y_test)

mae_rf <- MAE(rf_preds, y_test)

r2_rf <- R2(rf_preds, y_test)

rf_metrics <- data.frame(
  Model = "Random Forest",
  RMSE = rmse_rf,
  MAE = mae_rf,
  R2 = r2_rf
)

colnames(rf_metrics)[4] <- "$R^2$"

knitr::kable(
  rf_metrics,
  digits = 3,
  caption = "Random Forest Performance Metrics"
)
Random Forest Performance Metrics
Model RMSE MAE \(R^2\)
Random Forest 109.787 62.081 0.945

In comparison with the previous regression models, the random forest regression method has proven superior with regard to all the measures of assessment. This model demonstrated an \(R^2\) value of around 0.944. Therefore, one can say that almost 94% of the variance of equilibrium temperature is captured by the model.

The dramatic difference between the measures of both types of regressions points to the conclusion that non-linear dependencies and interactions between stars, orbits, and planets have a major impact on the equilibrium temperature. While in the case of linear regression, the algorithm is unable to detect these kinds of relations within the data.

This result demonstrates the effectiveness of using machine learning models in such cases.

Code
varImpPlot(
  rf_model,
  main = "Random Forest Variable Importance"
)

Random Forest Variable Importance

From the random forest variable importance analysis, pl_orbper and pl_orbsmax, indicating orbital period and orbital distance, respectively, are the two most important variables for predicting equilibrium temperature of the exoplanets. The properties of the stars, such as stellar radius, stellar temperature, and stellar mass, are also highly relevant.

This finding aligns well with the general expectations from astrophysics, as the equilibrium temperature of planets depends on the properties of the parent star and the orbital parameters of the planets. Planetary mass and orbital eccentricity were relatively insignificant compared to other variables in the random forest model.

The variable importance values validate our conclusion that the nonlinear relationship between these variables is significant for equilibrium temperature predictions.

Code
comparison_table <- data.frame(
  Model = c(
    "Linear Regression",
    "Ridge Regression",
    "Lasso Regression",
    "Random Forest"
  ),
  RMSE = c(
    rmse_lm,
    rmse_ridge,
    rmse_lasso,
    rmse_rf
  ),
  MAE = c(
    mae_lm,
    mae_ridge,
    mae_lasso,
    mae_rf
  ),
  R2 = c(
    r2_lm,
    r2_ridge,
    r2_lasso,
    r2_rf
  )
)

colnames(comparison_table)[4] <- "$R^2$"

knitr::kable(
  comparison_table,
  digits = 3,
  caption = "Model Performance Comparison"
)
Model Performance Comparison
Model RMSE MAE \(R^2\)
Linear Regression 350.006 259.251 0.427
Ridge Regression 360.803 268.428 0.388
Lasso Regression 365.669 272.591 0.372
Random Forest 109.787 62.081 0.945

The comparisons of the model predictions illustrate considerable discrepancies between the performance of each of the regression models under consideration. The multiple linear regression model demonstrated a decent predictive power, whereas ridge and lasso regression models generated lower results in terms of RMSE, MAE, and \(R^2\).

Of all the tested models, the random forest regression model demonstrated much higher performance than other methods. The method delivered the smallest errors in predictions and yielded the largest value of \(R^2\), which accounted for about 94% of the variance in the target variable.

Thus, the results demonstrate that non-linear dependencies between the parameters of exoplanets, their stars, and orbits significantly influence the equilibrium temperature of planets. Although the linear and regularization methods detected certain associations between exoplanet parameters, machine learning methods appear to perform better in complex astrophysical problems.

Conclusion

The current research explored the ability of regression and machine learning methods to make accurate predictions of equilibrium temperatures for exoplanets based on various features related to their stellar and orbital characteristics, which were collected from the NASA Exoplanet Archive website. An exploratory analysis found statistically significant associations between the target feature and some of the predictors, such as the temperature of the star, its mass, the distance of the orbit, and its period.

Four types of regression models, namely multiple linear, ridge, lasso, and random forest regressions, were estimated by calculating RMSE, MAE, and \(R^2\). The multiple linear regression model exhibited acceptable predictive accuracy, implying that there are multiple significant linear dependencies between the target variable and predictors. However, regularized regression models were no more accurate than the baseline linear regression model.

The random forest regression model proved to be the most effective among all methods, clearly outperforming other models according to all assessment criteria. This model captured nearly 94% of the variance in equilibrium temperature, meaning that non-linear dependencies between predictors and the target variable are significant.

The results from this study reveal the efficiency of using machine learning algorithms in analyzing astrophysical datasets. This study also highlights the role of stellar radiation and the arrangement of orbits in the environment of an exoplanet.

A number of shortcomings exist in this study as well. The dataset used in this study carries some observation errors, possible outliers, and variables which do not cover all the physical factors affecting the equilibrium temperature of a planet. Other future studies may consider testing other types of machine learning algorithms or employing more features to improve model accuracy.

In summary, this study illustrates the applicability of statistical methods and machine learning techniques on astrophysical datasets.

References

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning. Springer.

Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer.

NASA Exoplanet Archive. Planetary Systems Composite Parameters Table. California Institute of Technology. Retrieved from https://exoplanetarchive.ipac.caltech.edu/

Appendix

The following appendix contains selected R code used for data preparation, regression modeling, and machine learning analysis.

Data Preparation Code

Code
library(tidyverse)
library(caret)
library(glmnet)
library(randomForest)
library(broom)
library(corrplot)

exoplanets <- read_csv(
  "exoplanets.csv",
  comment = "#"
)

exo_clean <- exoplanets %>%
  select(
    pl_eqt,
    st_teff,
    st_mass,
    st_rad,
    pl_orbsmax,
    pl_orbper,
    pl_rade,
    pl_bmasse,
    pl_orbeccen
  ) %>%
  drop_na()

Train/Test Split

Code
set.seed(621)

train_index <- createDataPartition(
  exo_clean$pl_eqt,
  p = 0.8,
  list = FALSE
)

train_data <- exo_clean[train_index, ]
test_data  <- exo_clean[-train_index, ]

Multiple Linear Regression Code

Code
lm_model <- lm(
  pl_eqt ~ st_teff +
    st_mass +
    st_rad +
    pl_orbsmax +
    pl_orbper +
    pl_rade +
    pl_bmasse +
    pl_orbeccen,
  data = train_data
)

summary(lm_model)

Random Forest Regression Code

Code
rf_model <- randomForest(
  pl_eqt ~ .,
  data = train_data,
  importance = TRUE
)

rf_preds <- predict(
  rf_model,
  newdata = test_data
)

rmse_rf <- RMSE(rf_preds, y_test)

mae_rf <- MAE(rf_preds, y_test)

r2_rf <- R2(rf_preds, y_test)