Code
library(tidyverse)
library(caret)
library(glmnet)
library(randomForest)
library(broom)
library(corrplot)This project aims at investigating whether planetary equilibrium temperature can be accurately predicted based on the properties of the stars, the orbits, and planets themselves. Based on the NASA Exoplanet Archive dataset, this research uses regression modeling techniques to find correlations between planetary equilibrium temperature and other parameters such as the star temperature, star mass, star radius, orbit distance, planet radius, and planet mass. In this study, all steps of a predictive modeling approach are followed, such as data cleansing, exploratory data analysis, feature selection, model construction, and model assessment. Multiple regression and machine learning approaches were used, including multiple linear regression, regularized regression, and random forest regression. Model performances were compared using RMSE, MAE, and \(R^2\).
Exoplanets; Regression Modeling; Equilibrium Temperature; NASA Exoplanet Archive; Predictive Analytics
An exoplanet is a planet orbiting around stars beyond our Solar System. Many exoplanets have been discovered since the initial findings of extrasolar planets were confirmed, using different techniques like transit photometry, radial velocity observations, direct imaging, and microlensing. The discoveries made so far have helped increase knowledge about planetary systems, and many questions about their formation, orbits, and habitability remain unanswered.
One of the features used to study planets is equilibrium temperature. Equilibrium temperature serves as an estimation of the temperature a planet would expect depending on energy coming from its host star. Though it cannot accurately tell about surface temperatures, it can be useful in determining which planets are situated at certain temperatures that may be considered habitable.
The goal of this research is to construct a model to predict exoplanet equilibrium temperature based on different measurable variables. Such a topic is appropriate for the task due to the nature of the outcome variable (numeric), as well as the availability of possible predictors provided by the NASA Exoplanet Archive.
Applications of machine learning and regression analysis are significant for astrophysics research involving exoplanet and planetary system analysis. For example, machine learning and statistical methods have been used to predict certain parameters of planets, classify exoplanets, and identify habitable planets based on observations obtained by astronomical surveys (James et al., 2021).
Regression analysis is usually utilized in establishing the relationship between the characteristics of stars and planets. Regression analysis is effective in analyzing linear relationships and testing predictor significance. However, some astrophysical systems feature non-linear relationships that require other approaches to analyze them (Kuhn & Johnson, 2013).
In more recent times, researchers have attempted machine learning approaches like random forests, artificial neural networks, and support vector machines to analyze exoplanets. Such methods are effective at recognizing complicated nonlinear connections between variables, and usually perform better than conventional regression approaches when analyzing observational data sets. The random forest algorithm is especially useful since it not only identifies nonlinear relationships but also assesses the relative importance of each predictor (Hastie, Tibshirani, & Friedman, 2009).
The current research contributes to the literature by applying both regression and machine learning algorithms to model the continuous equilibrium temperature of an exoplanet. The research differs from other studies which mainly concentrate on classification algorithms for modeling exoplanets. In addition, the study tries to determine whether there are any benefits to be gained by utilizing a nonlinear approach in machine learning algorithms.
Which stellar, orbital, and planetary characteristics are most important for predicting exoplanet equilibrium temperature?
In this analysis, data on confirmed exoplanets is used, sourced from the NASA Exoplanet Archive. This analysis seeks to predict the equilibrium temperature of an exoplanet based on certain predictor variables that characterize the planet and its star. Regression modeling techniques were used to analyze the relationship between equilibrium temperature and the selected predictor variables.
The data was extracted from the NASA Exoplanet Archive Planetary Systems Composite Parameters table, which contains data on confirmed exoplanet observations and their stellar system characteristics.
Data preparation and cleaning were performed in R, with predictor variables that had too many missing values eliminated and those observations of predictor variables with missing values removed from the modeling data set. Variable distributions and correlations were examined through exploratory data analysis.
library(tidyverse)
library(caret)
library(glmnet)
library(randomForest)
library(broom)
library(corrplot)exoplanets <- read_csv(
"exoplanets.csv",
comment = "#"
)
glimpse(exoplanets)Rows: 6,286
Columns: 84
$ pl_name <chr> "11 Com b", "11 UMi b", "14 And b", "14 Her b", "16 Cy…
$ hostname <chr> "11 Com", "11 UMi", "14 And", "14 Her", "16 Cyg B", "1…
$ sy_snum <dbl> 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, …
$ sy_pnum <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, …
$ discoverymethod <chr> "Radial Velocity", "Radial Velocity", "Radial Velocity…
$ disc_year <dbl> 2007, 2009, 2008, 2002, 1996, 2020, 2008, 2008, 2018, …
$ disc_facility <chr> "Xinglong Station", "Thueringer Landessternwarte Taute…
$ pl_controv_flag <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pl_orbper <dbl> 323.21, 516.22, 186.76, 1766.41, 798.50, 578.38, 982.8…
$ pl_orbpererr1 <dbl> 0.06, 3.20, 0.11, 0.67, 1.00, 2.01, 1.06, NA, 0.00, 2.…
$ pl_orbpererr2 <dbl> -0.05, -3.20, -0.12, -0.68, -1.00, -2.09, -0.92, NA, -…
$ pl_orbperlim <dbl> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, NA, NA, NA, NA, NA, …
$ pl_orbsmax <dbl> 1.178, 1.530, 0.775, 2.839, 1.660, 1.450, 2.476, 330.0…
$ pl_orbsmaxerr1 <dbl> 0.000, 0.070, 0.000, 0.039, 0.030, 0.020, 0.002, NA, 0…
$ pl_orbsmaxerr2 <dbl> 0.000, -0.070, 0.000, -0.041, -0.030, -0.020, -0.002, …
$ pl_orbsmaxlim <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pl_rade <dbl> 12.20000, 12.30000, 13.10000, 12.50000, 13.50000, 12.9…
$ pl_radeerr1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_radeerr2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_radelim <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pl_radj <dbl> 1.090, 1.090, 1.160, 1.120, 1.200, 1.150, 1.120, 1.664…
$ pl_radjerr1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_radjerr2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_radjlim <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pl_bmasse <dbl> 4914.8985, 4684.8142, 1131.1513, 2828.6728, 565.7374, …
$ pl_bmasseerr1 <dbl> 39.09289, 794.57500, 36.23244, 413.17693, 25.42640, 47…
$ pl_bmasseerr2 <dbl> -39.72855, -794.57500, -38.77507, -540.30829, -25.4264…
$ pl_bmasselim <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pl_bmassj <dbl> 15.464, 14.740, 3.559, 8.900, 1.780, 4.320, 9.207, 8.0…
$ pl_bmassjerr1 <dbl> 0.123, 2.500, 0.114, 1.300, 0.080, 0.150, 0.160, 1.000…
$ pl_bmassjerr2 <dbl> -0.125, -2.500, -0.122, -1.700, -0.080, -0.120, -0.077…
$ pl_bmassjlim <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pl_bmassprov <chr> "Msini", "Msini", "Msini", "Mass", "Msini", "Msini", "…
$ pl_orbeccen <dbl> 0.2380, 0.0800, 0.0000, 0.3683, 0.6800, 0.0600, 0.0240…
$ pl_orbeccenerr1 <dbl> 0.0070, 0.0300, NA, 0.0029, 0.0200, 0.0300, 0.0070, NA…
$ pl_orbeccenerr2 <dbl> -0.0070, -0.0300, NA, -0.0029, -0.0200, -0.0200, -0.01…
$ pl_orbeccenlim <dbl> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, NA, NA, NA, NA, NA, …
$ pl_insol <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_insolerr1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_insolerr2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_insollim <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pl_eqt <dbl> NA, NA, NA, NA, NA, NA, NA, 1700, NA, NA, NA, 1450, NA…
$ pl_eqterr1 <dbl> NA, NA, NA, NA, NA, NA, NA, 100, NA, NA, NA, 50, NA, 1…
$ pl_eqterr2 <dbl> NA, NA, NA, NA, NA, NA, NA, -100, NA, NA, NA, -50, NA,…
$ pl_eqtlim <dbl> NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, 0, NA, 0, N…
$ ttv_flag <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ st_spectype <chr> "G8 III", "K4 III", "K0 III", "K0", "G3 V", "K3 III", …
$ st_teff <dbl> 4874, 4213, 4888, 5338, 5750, 4157, 4980, 4060, 4816, …
$ st_tefferr1 <dbl> NA, 46, NA, 25, 8, 11, NA, 300, NA, 44, 44, 100, NA, 5…
$ st_tefferr2 <dbl> NA, -46, NA, -25, -8, -10, NA, -200, NA, -44, -44, -10…
$ st_tefflim <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0,…
$ st_rad <dbl> 13.760000, 29.790000, 11.550000, 0.930000, 1.130000, 2…
$ st_raderr1 <dbl> 2.850000, 2.840000, 1.120000, 0.010000, 0.010000, 0.78…
$ st_raderr2 <dbl> -2.450000, -2.840000, -0.510000, -0.010000, -0.010000,…
$ st_radlim <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0,…
$ st_mass <dbl> 2.090000, 2.780000, 1.780000, 0.970000, 1.080000, 1.22…
$ st_masserr1 <dbl> 0.640000, 0.690000, 0.430000, 0.040000, 0.040000, 0.13…
$ st_masserr2 <dbl> -0.630000, -0.690000, -0.290000, -0.040000, -0.040000,…
$ st_masslim <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ st_met <dbl> -0.260, -0.020, -0.210, 0.430, 0.060, -0.010, -0.060, …
$ st_meterr1 <dbl> 0.10, NA, 0.10, 0.07, NA, 0.10, 0.10, NA, 0.10, 0.04, …
$ st_meterr2 <dbl> -0.10, NA, -0.10, -0.07, NA, -0.10, -0.10, NA, -0.10, …
$ st_metlim <dbl> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, NA, NA, NA, NA, N…
$ st_metratio <chr> "[Fe/H]", "[Fe/H]", "[Fe/H]", "[Fe/H]", "[Fe/H]", "[Fe…
$ st_logg <dbl> 2.45000, 1.93000, 2.55000, 4.45000, 4.36000, 1.70000, …
$ st_loggerr1 <dbl> 0.080000, 0.070000, 0.060000, 0.020000, 0.010000, 0.04…
$ st_loggerr2 <dbl> -0.080000, -0.070000, -0.070000, -0.020000, -0.010000,…
$ st_logglim <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0,…
$ rastr <chr> "12h20m42.91s", "15h17m05.90s", "23h31m17.80s", "16h10…
$ ra <dbl> 185.17878, 229.27460, 352.82415, 242.60210, 295.46564,…
$ decstr <chr> "+17d47m35.71s", "+71d49m26.19s", "+39d14m09.01s", "+4…
$ dec <dbl> 17.7932516, 71.8239428, 39.2358367, 43.8163621, 50.516…
$ sy_dist <dbl> 93.18460, 125.32100, 75.43920, 17.93230, 21.13970, 124…
$ sy_disterr1 <dbl> 1.9238000, 1.9765000, 0.7140000, 0.0073000, 0.0110000,…
$ sy_disterr2 <dbl> -1.9238000, -1.9765000, -0.7140000, -0.0073000, -0.011…
$ sy_vmag <dbl> 4.72307, 5.01300, 5.23133, 6.61935, 6.21500, 5.22606, …
$ sy_vmagerr1 <dbl> 0.023, 0.005, 0.023, 0.023, 0.016, 0.023, 0.023, 0.069…
$ sy_vmagerr2 <dbl> -0.023, -0.005, -0.023, -0.023, -0.016, -0.023, -0.023…
$ sy_kmag <dbl> 2.282000, 1.939000, 2.331000, 4.714000, 4.651000, 2.09…
$ sy_kmagerr1 <dbl> 0.346, 0.270, 0.240, 0.016, 0.016, 0.244, 0.204, 0.021…
$ sy_kmagerr2 <dbl> -0.346, -0.270, -0.240, -0.016, -0.016, -0.244, -0.204…
$ sy_gaiamag <dbl> 4.44038, 4.56216, 4.91781, 6.38300, 6.06428, 4.75429, …
$ sy_gaiamagerr1 <dbl> 0.0038479, 0.0039035, 0.0028262, 0.0003512, 0.0006029,…
$ sy_gaiamagerr2 <dbl> -0.0038479, -0.0039035, -0.0028262, -0.0003512, -0.000…
colnames(exoplanets) [1] "pl_name" "hostname" "sy_snum" "sy_pnum"
[5] "discoverymethod" "disc_year" "disc_facility" "pl_controv_flag"
[9] "pl_orbper" "pl_orbpererr1" "pl_orbpererr2" "pl_orbperlim"
[13] "pl_orbsmax" "pl_orbsmaxerr1" "pl_orbsmaxerr2" "pl_orbsmaxlim"
[17] "pl_rade" "pl_radeerr1" "pl_radeerr2" "pl_radelim"
[21] "pl_radj" "pl_radjerr1" "pl_radjerr2" "pl_radjlim"
[25] "pl_bmasse" "pl_bmasseerr1" "pl_bmasseerr2" "pl_bmasselim"
[29] "pl_bmassj" "pl_bmassjerr1" "pl_bmassjerr2" "pl_bmassjlim"
[33] "pl_bmassprov" "pl_orbeccen" "pl_orbeccenerr1" "pl_orbeccenerr2"
[37] "pl_orbeccenlim" "pl_insol" "pl_insolerr1" "pl_insolerr2"
[41] "pl_insollim" "pl_eqt" "pl_eqterr1" "pl_eqterr2"
[45] "pl_eqtlim" "ttv_flag" "st_spectype" "st_teff"
[49] "st_tefferr1" "st_tefferr2" "st_tefflim" "st_rad"
[53] "st_raderr1" "st_raderr2" "st_radlim" "st_mass"
[57] "st_masserr1" "st_masserr2" "st_masslim" "st_met"
[61] "st_meterr1" "st_meterr2" "st_metlim" "st_metratio"
[65] "st_logg" "st_loggerr1" "st_loggerr2" "st_logglim"
[69] "rastr" "ra" "decstr" "dec"
[73] "sy_dist" "sy_disterr1" "sy_disterr2" "sy_vmag"
[77] "sy_vmagerr1" "sy_vmagerr2" "sy_kmag" "sy_kmagerr1"
[81] "sy_kmagerr2" "sy_gaiamag" "sy_gaiamagerr1" "sy_gaiamagerr2"
summary(exoplanets) pl_name hostname sy_snum sy_pnum
Length:6286 Length:6286 Min. :1.000 Min. :1.000
Class :character Class :character 1st Qu.:1.000 1st Qu.:1.000
Mode :character Mode :character Median :1.000 Median :1.000
Mean :1.101 Mean :1.762
3rd Qu.:1.000 3rd Qu.:2.000
Max. :4.000 Max. :8.000
discoverymethod disc_year disc_facility pl_controv_flag
Length:6286 Min. :1992 Length:6286 Min. :0.000000
Class :character 1st Qu.:2014 Class :character 1st Qu.:0.000000
Mode :character Median :2016 Mode :character Median :0.000000
Mean :2017 Mean :0.008591
3rd Qu.:2021 3rd Qu.:0.000000
Max. :2026 Max. :1.000000
NA's :1
pl_orbper pl_orbpererr1 pl_orbpererr2
Min. : 0 Min. : 0 Min. :-100000000
1st Qu.: 4 1st Qu.: 0 1st Qu.: 0
Median : 11 Median : 0 Median : 0
Mean : 71986 Mean : 87738 Mean : -20041
3rd Qu.: 38 3rd Qu.: 0 3rd Qu.: 0
Max. :402000000 Max. :470000000 Max. : 0
NA's :340 NA's :830 NA's :830
pl_orbperlim pl_orbsmax pl_orbsmaxerr1
Min. :-1.00000 Min. :4.400e-03 Min. : 0.0000
1st Qu.: 0.00000 1st Qu.:5.230e-02 1st Qu.: 0.0007
Median : 0.00000 Median :1.021e-01 Median : 0.0020
Mean :-0.00101 Mean :1.566e+01 Mean : 1.8013
3rd Qu.: 0.00000 3rd Qu.:3.047e-01 3rd Qu.: 0.0155
Max. : 0.00000 Max. :1.900e+04 Max. :5205.0000
NA's :340 NA's :426 NA's :2364
pl_orbsmaxerr2 pl_orbsmaxlim pl_rade pl_radeerr1
Min. :-2.060e+03 Min. :-1.00000 Min. : 0.3098 Min. : 0.0000
1st Qu.:-1.600e-02 1st Qu.: 0.00000 1st Qu.: 1.8400 1st Qu.: 0.1273
Median :-2.000e-03 Median : 0.00000 Median : 2.8466 Median : 0.2700
Mean :-9.593e-01 Mean :-0.00051 Mean : 5.8078 Mean : 0.5144
3rd Qu.:-7.300e-04 3rd Qu.: 0.00000 3rd Qu.:11.9000 3rd Qu.: 0.5500
Max. : 0.000e+00 Max. : 0.00000 Max. :87.2059 Max. :68.9100
NA's :2364 NA's :425 NA's :50 NA's :1935
pl_radeerr2 pl_radelim pl_radj pl_radjerr1
Min. :-32.5061 Min. :-1.000000 Min. :0.02764 Min. :0.00000
1st Qu.: -0.4500 1st Qu.: 0.000000 1st Qu.:0.16415 1st Qu.:0.01128
Median : -0.2200 Median : 0.000000 Median :0.25383 Median :0.02409
Mean : -0.4135 Mean :-0.000641 Mean :0.51816 Mean :0.04589
3rd Qu.: -0.1100 3rd Qu.: 0.000000 3rd Qu.:1.06000 3rd Qu.:0.04907
Max. : 0.0000 Max. : 0.000000 Max. :7.78000 Max. :6.14774
NA's :1935 NA's :50 NA's :50 NA's :1935
pl_radjerr2 pl_radjlim pl_bmasse pl_bmasseerr1
Min. :-2.90000 Min. :-1.000000 Min. : 0.020 Min. : 0.00
1st Qu.:-0.04000 1st Qu.: 0.000000 1st Qu.: 4.275 1st Qu.: 2.20
Median :-0.01963 Median : 0.000000 Median : 9.200 Median : 18.12
Mean :-0.03690 Mean :-0.000641 Mean : 400.705 Mean : 173.37
3rd Qu.:-0.00981 3rd Qu.: 0.000000 3rd Qu.: 182.911 3rd Qu.: 77.00
Max. : 0.00000 Max. : 0.000000 Max. :9534.852 Max. :12652.75
NA's :1935 NA's :50 NA's :31 NA's :3273
pl_bmasseerr2 pl_bmasselim pl_bmassj pl_bmassjerr1
Min. :-6038.74 Min. :-1.00000 Min. :6.290e-05 Min. : 0.0000
1st Qu.: -69.92 1st Qu.: 0.00000 1st Qu.:1.343e-02 1st Qu.: 0.0069
Median : -16.84 Median : 0.00000 Median :2.895e-02 Median : 0.0580
Mean : -126.42 Mean : 0.03229 Mean :1.261e+00 Mean : 0.5459
3rd Qu.: -2.04 3rd Qu.: 0.00000 3rd Qu.:5.755e-01 3rd Qu.: 0.2410
Max. : 0.00 Max. : 1.00000 Max. :3.000e+01 Max. :39.8100
NA's :3273 NA's :31 NA's :31 NA's :3273
pl_bmassjerr2 pl_bmassjlim pl_bmassprov pl_orbeccen
Min. :-19.0000 Min. :-1.00000 Length:6286 Min. :0.00000
1st Qu.: -0.2200 1st Qu.: 0.00000 Class :character 1st Qu.:0.00000
Median : -0.0530 Median : 0.00000 Mode :character Median :0.00000
Mean : -0.3979 Mean : 0.03229 Mean :0.07927
3rd Qu.: -0.0064 3rd Qu.: 0.00000 3rd Qu.:0.09000
Max. : 0.0000 Max. : 1.00000 Max. :0.95000
NA's :3273 NA's :31 NA's :1049
pl_orbeccenerr1 pl_orbeccenerr2 pl_orbeccenlim pl_insol
Min. :0.0000 Min. :-0.7030 Min. :-1.00000 Min. : 0.00
1st Qu.:0.0200 1st Qu.:-0.0788 1st Qu.: 0.00000 1st Qu.: 24.10
Median :0.0480 Median :-0.0400 Median : 0.00000 Median : 99.99
Mean :0.0676 Mean :-0.0548 Mean : 0.05232 Mean : 419.35
3rd Qu.:0.0900 3rd Qu.:-0.0170 3rd Qu.: 0.00000 3rd Qu.: 376.07
Max. :0.5000 Max. : 0.0000 Max. : 1.00000 Max. :44900.00
NA's :4408 NA's :4408 NA's :1049 NA's :1880
pl_insolerr1 pl_insolerr2 pl_insollim pl_eqt
Min. : 0.000 Min. :-7200.000 Min. :0 Min. : 34.0
1st Qu.: 1.341 1st Qu.: -22.008 1st Qu.:0 1st Qu.: 569.0
Median : 5.905 Median : -5.225 Median :0 Median : 823.0
Mean : 41.333 Mean : -32.019 Mean :0 Mean : 914.5
3rd Qu.: 26.389 3rd Qu.: -1.253 3rd Qu.:0 3rd Qu.:1163.0
Max. :8100.000 Max. : 0.000 Max. :0 Max. :4050.0
NA's :2644 NA's :2644 NA's :1880 NA's :1601
pl_eqterr1 pl_eqterr2 pl_eqtlim ttv_flag
Min. : 0.00 Min. :-1217.00 Min. :0.00000 Min. :0.00000
1st Qu.: 11.25 1st Qu.: -34.00 1st Qu.:0.00000 1st Qu.:0.00000
Median : 20.00 Median : -19.21 Median :0.00000 Median :0.00000
Mean : 31.39 Mean : -30.84 Mean :0.00064 Mean :0.07811
3rd Qu.: 35.00 3rd Qu.: -11.00 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1217.00 Max. : -0.32 Max. :1.00000 Max. :1.00000
NA's :4459 NA's :4459 NA's :1601
st_spectype st_teff st_tefferr1 st_tefferr2
Length:6286 Min. : 415 Min. : 1.00 Min. :-2360.24
Class :character 1st Qu.: 4896 1st Qu.: 59.83 1st Qu.: -124.62
Mode :character Median : 5542 Median : 86.00 Median : -86.62
Mean : 5393 Mean : 98.50 Mean : -99.22
3rd Qu.: 5897 3rd Qu.: 120.36 3rd Qu.: -58.55
Max. :57000 Max. :1763.10 Max. : -1.00
NA's :294 NA's :607 NA's :607
st_tefflim st_rad st_raderr1 st_raderr2
Min. :0.000000 Min. : 0.0115 Min. : 0.0000 Min. :-28.97000
1st Qu.:0.000000 1st Qu.: 0.7700 1st Qu.: 0.0230 1st Qu.: -0.10500
Median :0.000000 Median : 0.9500 Median : 0.0500 Median : -0.05000
Mean :0.000167 Mean : 1.4911 Mean : 0.1876 Mean : -0.14659
3rd Qu.:0.000000 3rd Qu.: 1.2400 3rd Qu.: 0.1640 3rd Qu.: -0.02279
Max. :1.000000 Max. :88.4750 Max. :104.5280 Max. : 0.00000
NA's :294 NA's :318 NA's :590 NA's :591
st_radlim st_mass st_masserr1 st_masserr2
Min. :0 Min. : 0.0094 Min. : 0.00020 Min. :-10.02000
1st Qu.:0 1st Qu.: 0.7700 1st Qu.: 0.02900 1st Qu.: -0.08000
Median :0 Median : 0.9400 Median : 0.04800 Median : -0.04900
Mean :0 Mean : 0.9355 Mean : 0.08517 Mean : -0.07891
3rd Qu.:0 3rd Qu.: 1.0830 3rd Qu.: 0.08900 3rd Qu.: -0.02900
Max. :0 Max. :10.9400 Max. :10.02000 Max. : 0.00000
NA's :318 NA's :9 NA's :285 NA's :286
st_masslim st_met st_meterr1 st_meterr2
Min. :0 Min. :-1.00000 Min. :0.00100 Min. :-1.0000
1st Qu.:0 1st Qu.:-0.08000 1st Qu.:0.04000 1st Qu.:-0.1500
Median :0 Median : 0.02000 Median :0.08120 Median :-0.0817
Mean :0 Mean : 0.01658 Mean :0.09647 Mean :-0.1022
3rd Qu.:0 3rd Qu.: 0.13000 3rd Qu.:0.14000 3rd Qu.:-0.0400
Max. :0 Max. : 0.79000 Max. :0.96000 Max. : 0.0000
NA's :9 NA's :640 NA's :924 NA's :924
st_metlim st_metratio st_logg st_loggerr1
Min. :-1 Length:6286 Min. :0.541 Min. :0.00000
1st Qu.: 0 Class :character 1st Qu.:4.294 1st Qu.:0.02900
Median : 0 Mode :character Median :4.458 Median :0.05500
Mean : 0 Mean :4.386 Mean :0.08246
3rd Qu.: 0 3rd Qu.:4.580 3rd Qu.:0.10300
Max. : 1 Max. :8.070 Max. :1.10000
NA's :640 NA's :322 NA's :613
st_loggerr2 st_logglim rastr ra
Min. :-3.51000 Min. :-1.00000 Length:6286 Min. : 0.1856
1st Qu.:-0.15000 1st Qu.: 0.00000 Class :character 1st Qu.:169.4479
Median :-0.07200 Median : 0.00000 Mode :character Median :284.4335
Mean :-0.09688 Mean :-0.00101 Mean :231.2644
3rd Qu.:-0.03000 3rd Qu.: 0.00000 3rd Qu.:293.1234
Max. : 0.00000 Max. : 0.00000 Max. :359.9750
NA's :613 NA's :322
decstr dec sy_dist sy_disterr1
Length:6286 Min. :-89.47 Min. : 1.301 Min. :3.400e-04
Class :character 1st Qu.:-12.05 1st Qu.: 101.730 1st Qu.:4.220e-01
Mode :character Median : 39.04 Median : 362.715 Median :3.781e+00
Mean : 17.71 Mean : 704.443 Mean :5.916e+01
3rd Qu.: 45.41 3rd Qu.: 826.074 3rd Qu.:1.668e+01
Max. : 88.83 Max. :8500.000 Max. :2.600e+03
NA's :27 NA's :135
sy_disterr2 sy_vmag sy_vmagerr1 sy_vmagerr2
Min. :-2.840e+03 Min. : 0.872 Min. :0.00100 Min. :-11.92000
1st Qu.:-1.621e+01 1st Qu.:10.700 1st Qu.:0.03000 1st Qu.: -0.12600
Median :-3.725e+00 Median :13.193 Median :0.06900 Median : -0.06900
Mean :-6.636e+01 Mean :12.541 Mean :0.09716 Mean : -0.09863
3rd Qu.:-4.200e-01 3rd Qu.:14.915 3rd Qu.:0.12600 3rd Qu.: -0.03000
Max. :-3.500e-04 Max. :44.610 Max. :3.10000 Max. : -0.00100
NA's :135 NA's :299 NA's :307 NA's :312
sy_kmag sy_kmagerr1 sy_kmagerr2 sy_gaiamag
Min. :-3.044 Min. :0.01100 Min. :-9.99500 Min. : 2.364
1st Qu.: 8.406 1st Qu.:0.02000 1st Qu.:-0.03000 1st Qu.:10.424
Median :11.056 Median :0.02300 Median :-0.02300 Median :12.920
Mean :10.372 Mean :0.04096 Mean :-0.04098 Mean :12.248
3rd Qu.:12.706 3rd Qu.:0.03000 3rd Qu.:-0.02000 3rd Qu.:14.660
Max. :33.110 Max. :9.99500 Max. :-0.01100 Max. :20.186
NA's :286 NA's :323 NA's :334 NA's :343
sy_gaiamagerr1 sy_gaiamagerr2
Min. :0.00011 Min. :-0.06323
1st Qu.:0.00026 1st Qu.:-0.00054
Median :0.00036 Median :-0.00036
Mean :0.00064 Mean :-0.00064
3rd Qu.:0.00054 3rd Qu.:-0.00026
Max. :0.06323 Max. :-0.00011
NA's :343 NA's :343
Exoplanet equilibrium temperature (pl_eqt) was used as the response variable for this study. Predictor variables have been chosen according to their anticipated physical relationship with planetary temperature.
Selected predictor variables include:
st_teff)st_mass)st_rad)pl_orbsmax)pl_orbper)pl_rade)pl_bmasse)pl_orbeccen)exo_clean <- exoplanets %>%
select(
pl_eqt,
st_teff,
st_mass,
st_rad,
pl_orbsmax,
pl_orbper,
pl_rade,
pl_bmasse,
pl_orbeccen
) %>%
drop_na()
dim(exo_clean)[1] 4106 9
summary(exo_clean) pl_eqt st_teff st_mass st_rad
Min. : 55.9 Min. : 2566 Min. :0.0898 Min. :0.0131
1st Qu.: 568.0 1st Qu.: 5080 1st Qu.:0.8110 1st Qu.:0.7820
Median : 822.0 Median : 5618 Median :0.9500 Median :0.9495
Mean : 911.9 Mean : 5426 Mean :0.9473 Mean :1.0231
3rd Qu.:1166.0 3rd Qu.: 5946 3rd Qu.:1.0810 3rd Qu.:1.2007
Max. :4050.0 Max. :10170 Max. :2.7800 Max. :6.3000
pl_orbsmax pl_orbper pl_rade pl_bmasse
Min. : 0.005626 Min. :1.800e-01 Min. : 0.3098 Min. : 0.0374
1st Qu.: 0.048000 1st Qu.:3.947e+00 1st Qu.: 1.6400 1st Qu.: 3.5000
Median : 0.079500 Median :8.759e+00 Median : 2.4900 Median : 7.2700
Mean : 0.205912 Mean :1.273e+02 Mean : 4.5440 Mean : 147.4653
3rd Qu.: 0.147295 3rd Qu.:2.122e+01 3rd Qu.: 4.3407 3rd Qu.: 26.6994
Max. :63.000000 Max. :1.170e+05 Max. :25.0000 Max. :8899.1954
pl_orbeccen
Min. :0.00000
1st Qu.:0.00000
Median :0.00000
Mean :0.04151
3rd Qu.:0.01300
Max. :0.93183
colSums(is.na(exo_clean))Missing data were also assessed in the dataset after the preprocessing stage. No missing values existed in the selected attributes in the final cleaned dataset.
ggplot(exo_clean, aes(x = pl_eqt)) +
geom_histogram(bins = 30) +
labs(
title = "Distribution of Exoplanet Equilibrium Temperature",
x = "Equilibrium Temperature (K)",
y = "Count"
) +
theme(
plot.title = element_text(hjust = 0.5)
)The exoplanet equilibrium temperature distribution is skewed to the right. The majority of the planets have equilibrium temperatures ranging from around 400 K to 1200 K. There are also a few planets with very high equilibrium temperatures, causing the distribution to have a long right tail. This implies the existence of potential outliers, implying that certain planetary systems are exposed to a much higher amount of radiation from their stars compared to other planetary systems.
cor_matrix <- cor(exo_clean)
corrplot(
cor_matrix,
method = "color",
tl.cex = 0.7
)A number of significant relationships can be found from the correlation matrix. The equilibrium temperature of exoplanets (pl_eqt) is positively correlated with stellar temperature, stellar mass, stellar radius, and planetary radius. Such relationships imply that the higher the stellar temperature, stellar mass, stellar radius, and planetary radius, the higher the equilibrium temperature of planets.
In addition, there exist very high positive correlations between several stellar features, especially stellar temperature, stellar mass, and stellar radius. These relationships imply multicollinearity between predictor variables that might influence the stability of the coefficients in the multiple linear regression models. Thus, regularization methods such as ridge and lasso regression can be used later during model training.
Finally, orbital distance (pl_orbsmax) and orbital period (pl_orbper) have a very high positive correlation. It corresponds to well-known relationships of orbital mechanics.
ggplot(exo_clean, aes(x = st_teff, y = pl_eqt)) +
geom_point(alpha = 0.5) +
labs(
title = "Equilibrium Temperature vs Stellar Temperature",
x = "Stellar Temperature",
y = "Equilibrium Temperature"
) +
theme(
plot.title = element_text(hjust = 0.5)
)The scatterplot indicates a positive relationship between stellar temperature and exoplanet equilibrium temperature. Planets orbiting hotter stars generally exhibit higher equilibrium temperatures, which is consistent with the physical expectation that hotter stars emit greater amounts of stellar radiation.
The relationship also displays increasing variability at higher stellar temperatures, suggesting that additional planetary and orbital factors influence equilibrium temperature. Several extreme observations are also visible, indicating the presence of potentially influential outliers within the dataset.
ggplot(exo_clean, aes(x = pl_orbsmax, y = pl_eqt)) +
geom_point(alpha = 0.5) +
scale_x_log10() +
labs(
title = "Equilibrium Temperature vs Orbital Distance",
x = "Semi-Major Axis (log scale)",
y = "Equilibrium Temperature"
) +
theme(
plot.title = element_text(hjust = 0.5)
)From the log-scale scatter plot, there appears to exist a very clear negative correlation between orbital distance and exoplanet equilibrium temperature. Exoplanets that lie nearer to their stars have considerably high equilibrium temperatures because of the high levels of stellar energy received by such planets. The higher the orbital distance, the lower the equilibrium temperature.
In an attempt to represent the extensive range of orbital distances, I used a logarithm scale for the orbital distance or semi-major axis on the graph. From the curve formed from the scatter plot, it is evident that the orbital distance will play a very crucial role in prediction within the regression model.
set.seed(621)
train_index <- createDataPartition(
exo_clean$pl_eqt,
p = 0.8,
list = FALSE
)
train_data <- exo_clean[train_index, ]
test_data <- exo_clean[-train_index, ]The data was split into two sets, that is, the training set and the testing set, through an 80/20 split. The training set was applied for modeling purposes while the testing set was employed for testing purposes only.
lm_model <- lm(
pl_eqt ~ st_teff +
st_mass +
st_rad +
pl_orbsmax +
pl_orbper +
pl_rade +
pl_bmasse +
pl_orbeccen,
data = train_data
)
broom::tidy(lm_model) %>%
knitr::kable(
digits = 3,
caption = "Multiple Linear Regression Coefficients"
)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 45.591 | 54.670 | 0.834 | 0.404 |
| st_teff | 0.053 | 0.017 | 3.082 | 0.002 |
| st_mass | 426.962 | 64.943 | 6.574 | 0.000 |
| st_rad | 90.816 | 24.327 | 3.733 | 0.000 |
| pl_orbsmax | -254.960 | 15.275 | -16.691 | 0.000 |
| pl_orbper | 0.138 | 0.009 | 15.435 | 0.000 |
| pl_rade | 27.796 | 1.622 | 17.133 | 0.000 |
| pl_bmasse | 0.044 | 0.012 | 3.652 | 0.000 |
| pl_orbeccen | -446.508 | 59.683 | -7.481 | 0.000 |
The results of the multiple linear regression showed a set of statistically significant variables affecting exoplanet equilibrium temperature. Stellar temperature, stellar mass, and stellar radius were positively correlated with the dependent variable, meaning that the equilibrium temperature was higher in planetary systems around more massive and hotter stars.
A strong negative correlation between orbital distance (pl_orbsmax) and equilibrium temperature meant that more distant planets received less radiation from their host stars and had, accordingly, lower temperatures. Negative correlation was also observed for orbital eccentricity and equilibrium temperature.
Positive correlations were found for both planetary radius and planetary mass, implying that the equilibrium temperature was higher on average in planetary systems featuring larger planets within the sample.
Statistical significance was reached by most of the predictors at the 0.05 level.
lm_preds <- predict(
lm_model,
newdata = test_data
)rmse_lm <- RMSE(lm_preds, test_data$pl_eqt)
mae_lm <- MAE(lm_preds, test_data$pl_eqt)
r2_lm <- R2(lm_preds, test_data$pl_eqt)
metrics_lm <- data.frame(
Model = "Linear Regression",
RMSE = rmse_lm,
MAE = mae_lm,
`$R^2$` = r2_lm,
check.names = FALSE
)
knitr::kable(
metrics_lm,
digits = 3,
caption = "Linear Regression Performance Metrics"
)| Model | RMSE | MAE | \(R^2\) |
|---|---|---|---|
| Linear Regression | 350.006 | 259.251 | 0.427 |
According to the multiple linear regression model, the testing dataset results are moderate in terms of prediction accuracy, with an \(R^2\) score of approximately 0.427, meaning the regression model accounts for about 43% of the variation in the equilibrium temperature of the exoplanets.
The obtained values for the Root Mean Squared Error (RMSE) and the Mean Absolute Error (MAE) show that errors will still be evident in predictions. Despite the numerous dependencies and relationships that are captured by the linear regression model, other factors may contribute to the equilibrium temperature beyond those considered within the linear model.
plot_df <- data.frame(
Actual = test_data$pl_eqt,
Predicted = lm_preds
)
ggplot(plot_df, aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.5) +
geom_abline(
slope = 1,
intercept = 0,
color = "red"
) +
labs(
title = "Predicted vs Actual Equilibrium Temperature",
x = "Actual Equilibrium Temperature",
y = "Predicted Equilibrium Temperature"
) +
theme(
plot.title = element_text(hjust = 0.5)
)The comparison of the predicted against the actual shows that the multiple linear regression model explains the overall relationship between the observed values and the predicted equilibrium temperature values. Most observations lie within the vicinity of the reference line, implying that the model has decent prediction capabilities for a majority of the data.
The distance among the observations becomes wider as equilibrium temperatures rise, showing that the model faces some challenges in predicting planets with more extreme temperatures. There are also some observations which lie far away from the reference line, implying that there may be some outliers in the data.
residual_df <- data.frame(
Predicted = lm_preds,
Residuals = test_data$pl_eqt - lm_preds
)
ggplot(residual_df, aes(x = Predicted, y = Residuals)) +
geom_point(alpha = 0.5) +
geom_hline(
yintercept = 0,
color = "red"
) +
labs(
title = "Residual Plot",
x = "Predicted Values",
y = "Residuals"
) +
theme(
plot.title = element_text(hjust = 0.5)
)The plot of residuals illustrates that the residuals are scattered around zero values, thus demonstrating that the linear regression approach identifies most of the fundamental relations between variables. There does not seem to be any noticeable pattern, which further indicates the correctness of the chosen regression method.
Nevertheless, there is an increase in the variance of the residuals in some cases of higher estimated equilibrium temperatures, implying possible heteroscedasticity and nonlinear relationships between variables. There are also several high residuals, implying that some planetary systems are not captured by the linear regression approach.
x_train <- model.matrix(
pl_eqt ~ .,
train_data
)[, -1]
y_train <- train_data$pl_eqt
x_test <- model.matrix(
pl_eqt ~ .,
test_data
)[, -1]
y_test <- test_data$pl_eqtridge_model <- cv.glmnet(
x_train,
y_train,
alpha = 0
)ridge_preds <- predict(
ridge_model,
s = ridge_model$lambda.min,
newx = x_test
)ridge_preds <- as.numeric(ridge_preds)
rmse_ridge <- RMSE(ridge_preds, y_test)
mae_ridge <- MAE(ridge_preds, y_test)
r2_ridge <- R2(ridge_preds, y_test)
ridge_metrics <- data.frame(
Model = "Ridge Regression",
RMSE = rmse_ridge,
MAE = mae_ridge,
`$R^2$` = r2_ridge,
check.names = FALSE
)
knitr::kable(
ridge_metrics,
digits = 3,
caption = "Ridge Regression Performance Metrics"
)| Model | RMSE | MAE | \(R^2\) |
|---|---|---|---|
| Ridge Regression | 360.803 | 268.428 | 0.388 |
The ridge regression model exhibited predictive accuracy similar to the standard multiple linear regression model, although the model had slightly less predictive accuracy in terms of RMSE, MAE, and \(R^2\) values. This indicates that even though there is multicollinearity between several variables, the technique of shrinking the coefficients did not have much effect on the performance of the model.
Even though the predictive accuracy was slightly reduced, ridge regression is still helpful in stabilizing coefficients.
lasso_preds <- predict(
lasso_model,
s = lasso_model$lambda.min,
newx = x_test
)
lasso_preds <- as.numeric(lasso_preds)rmse_lasso <- RMSE(lasso_preds, y_test)
mae_lasso <- MAE(lasso_preds, y_test)
r2_lasso <- R2(lasso_preds, y_test)
lasso_metrics <- data.frame(
Model = "Lasso Regression",
RMSE = rmse_lasso,
MAE = mae_lasso,
`$R^2$` = r2_lasso,
check.names = FALSE
)
knitr::kable(
lasso_metrics,
digits = 3,
caption = "Lasso Regression Performance Metrics"
)| Model | RMSE | MAE | \(R^2\) |
|---|---|---|---|
| Lasso Regression | 365.669 | 272.591 | 0.372 |
Lasso regression was found to perform less well compared to the baseline regression and ridge regression methods. The poor predictive performance indicates that the process of pushing coefficients toward zero has resulted in the loss of information needed to make accurate predictions.
In spite of its poor predictive performance, lasso regression helps us understand the importance of different features in terms of their impact on the dependent variable. Some predictors still possess large coefficients, such as the star’s mass, radius, distance, and planet’s radius.
lasso_coef <- as.matrix(
coef(lasso_model, s = "lambda.min")
)
lasso_coef_df <- data.frame(
Variable = rownames(lasso_coef),
Coefficient = lasso_coef[,1],
row.names = NULL
)
knitr::kable(
lasso_coef_df,
digits = 3,
caption = "Lasso Regression Coefficients"
)| Variable | Coefficient |
|---|---|
| (Intercept) | 116.587 |
| st_teff | 0.030 |
| st_mass | 485.170 |
| st_rad | 73.219 |
| pl_orbsmax | -61.009 |
| pl_orbper | 0.025 |
| pl_rade | 26.849 |
| pl_bmasse | 0.005 |
| pl_orbeccen | -399.828 |
Coefficient shrinkage was used in the lasso regression model to minimize the complexity of the model and automate the process of feature selection. Some of the predictors showed considerable weights, which include stellar mass, stellar radius, orbital distance, planetary radius, and orbital eccentricity, and thus these variables still contribute significantly to the prediction of the equilibrium temperature.
Predictors like orbital period (pl_orbper) and planetary mass (pl_bmasse) have been shrunk significantly to near zero. This implies that these predictors did not contribute significantly to the predictive power of the regression model after other predictors.
From this feature selection process, lasso regression can be deemed useful in finding the significant predictors for regression analysis.
set.seed(621)
rf_model <- randomForest(
pl_eqt ~ .,
data = train_data,
importance = TRUE
)rf_preds <- predict(
rf_model,
newdata = test_data
)rmse_rf <- RMSE(rf_preds, y_test)
mae_rf <- MAE(rf_preds, y_test)
r2_rf <- R2(rf_preds, y_test)
rf_metrics <- data.frame(
Model = "Random Forest",
RMSE = rmse_rf,
MAE = mae_rf,
R2 = r2_rf
)
colnames(rf_metrics)[4] <- "$R^2$"
knitr::kable(
rf_metrics,
digits = 3,
caption = "Random Forest Performance Metrics"
)| Model | RMSE | MAE | \(R^2\) |
|---|---|---|---|
| Random Forest | 109.787 | 62.081 | 0.945 |
In comparison with the previous regression models, the random forest regression method has proven superior with regard to all the measures of assessment. This model demonstrated an \(R^2\) value of around 0.944. Therefore, one can say that almost 94% of the variance of equilibrium temperature is captured by the model.
The dramatic difference between the measures of both types of regressions points to the conclusion that non-linear dependencies and interactions between stars, orbits, and planets have a major impact on the equilibrium temperature. While in the case of linear regression, the algorithm is unable to detect these kinds of relations within the data.
This result demonstrates the effectiveness of using machine learning models in such cases.
varImpPlot(
rf_model,
main = "Random Forest Variable Importance"
)From the random forest variable importance analysis, pl_orbper and pl_orbsmax, indicating orbital period and orbital distance, respectively, are the two most important variables for predicting equilibrium temperature of the exoplanets. The properties of the stars, such as stellar radius, stellar temperature, and stellar mass, are also highly relevant.
This finding aligns well with the general expectations from astrophysics, as the equilibrium temperature of planets depends on the properties of the parent star and the orbital parameters of the planets. Planetary mass and orbital eccentricity were relatively insignificant compared to other variables in the random forest model.
The variable importance values validate our conclusion that the nonlinear relationship between these variables is significant for equilibrium temperature predictions.
comparison_table <- data.frame(
Model = c(
"Linear Regression",
"Ridge Regression",
"Lasso Regression",
"Random Forest"
),
RMSE = c(
rmse_lm,
rmse_ridge,
rmse_lasso,
rmse_rf
),
MAE = c(
mae_lm,
mae_ridge,
mae_lasso,
mae_rf
),
R2 = c(
r2_lm,
r2_ridge,
r2_lasso,
r2_rf
)
)
colnames(comparison_table)[4] <- "$R^2$"
knitr::kable(
comparison_table,
digits = 3,
caption = "Model Performance Comparison"
)| Model | RMSE | MAE | \(R^2\) |
|---|---|---|---|
| Linear Regression | 350.006 | 259.251 | 0.427 |
| Ridge Regression | 360.803 | 268.428 | 0.388 |
| Lasso Regression | 365.669 | 272.591 | 0.372 |
| Random Forest | 109.787 | 62.081 | 0.945 |
The comparisons of the model predictions illustrate considerable discrepancies between the performance of each of the regression models under consideration. The multiple linear regression model demonstrated a decent predictive power, whereas ridge and lasso regression models generated lower results in terms of RMSE, MAE, and \(R^2\).
Of all the tested models, the random forest regression model demonstrated much higher performance than other methods. The method delivered the smallest errors in predictions and yielded the largest value of \(R^2\), which accounted for about 94% of the variance in the target variable.
Thus, the results demonstrate that non-linear dependencies between the parameters of exoplanets, their stars, and orbits significantly influence the equilibrium temperature of planets. Although the linear and regularization methods detected certain associations between exoplanet parameters, machine learning methods appear to perform better in complex astrophysical problems.
The current research explored the ability of regression and machine learning methods to make accurate predictions of equilibrium temperatures for exoplanets based on various features related to their stellar and orbital characteristics, which were collected from the NASA Exoplanet Archive website. An exploratory analysis found statistically significant associations between the target feature and some of the predictors, such as the temperature of the star, its mass, the distance of the orbit, and its period.
Four types of regression models, namely multiple linear, ridge, lasso, and random forest regressions, were estimated by calculating RMSE, MAE, and \(R^2\). The multiple linear regression model exhibited acceptable predictive accuracy, implying that there are multiple significant linear dependencies between the target variable and predictors. However, regularized regression models were no more accurate than the baseline linear regression model.
The random forest regression model proved to be the most effective among all methods, clearly outperforming other models according to all assessment criteria. This model captured nearly 94% of the variance in equilibrium temperature, meaning that non-linear dependencies between predictors and the target variable are significant.
The results from this study reveal the efficiency of using machine learning algorithms in analyzing astrophysical datasets. This study also highlights the role of stellar radiation and the arrangement of orbits in the environment of an exoplanet.
A number of shortcomings exist in this study as well. The dataset used in this study carries some observation errors, possible outliers, and variables which do not cover all the physical factors affecting the equilibrium temperature of a planet. Other future studies may consider testing other types of machine learning algorithms or employing more features to improve model accuracy.
In summary, this study illustrates the applicability of statistical methods and machine learning techniques on astrophysical datasets.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning. Springer.
Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer.
NASA Exoplanet Archive. Planetary Systems Composite Parameters Table. California Institute of Technology. Retrieved from https://exoplanetarchive.ipac.caltech.edu/
The following appendix contains selected R code used for data preparation, regression modeling, and machine learning analysis.
library(tidyverse)
library(caret)
library(glmnet)
library(randomForest)
library(broom)
library(corrplot)
exoplanets <- read_csv(
"exoplanets.csv",
comment = "#"
)
exo_clean <- exoplanets %>%
select(
pl_eqt,
st_teff,
st_mass,
st_rad,
pl_orbsmax,
pl_orbper,
pl_rade,
pl_bmasse,
pl_orbeccen
) %>%
drop_na()set.seed(621)
train_index <- createDataPartition(
exo_clean$pl_eqt,
p = 0.8,
list = FALSE
)
train_data <- exo_clean[train_index, ]
test_data <- exo_clean[-train_index, ]lm_model <- lm(
pl_eqt ~ st_teff +
st_mass +
st_rad +
pl_orbsmax +
pl_orbper +
pl_rade +
pl_bmasse +
pl_orbeccen,
data = train_data
)
summary(lm_model)rf_model <- randomForest(
pl_eqt ~ .,
data = train_data,
importance = TRUE
)
rf_preds <- predict(
rf_model,
newdata = test_data
)
rmse_rf <- RMSE(rf_preds, y_test)
mae_rf <- MAE(rf_preds, y_test)
r2_rf <- R2(rf_preds, y_test)