Predicting Exoplanet Equilibrium Temperature Using Regression Modeling
Abstract
This project aims at investigating whether planetary equilibrium temperature can be accurately predicted based on the properties of the stars, the orbits, and planets themselves. Based on the NASA Exoplanet Archive dataset, this research uses regression modeling techniques to find correlations between planetary equilibrium temperature and other parameters such as the star temperature, star mass, star radius, orbit distance, planet radius, and planet mass. In this study, all steps of a predictive modeling approach are followed, such as data cleansing, exploratory data analysis, feature selection, model construction, and model assessment. Multiple regression and machine learning approaches were used, including multiple linear regression, regularized regression, and random forest regression. Model performances were compared using RMSE, MAE, and \(R^2\).
Keywords
Exoplanets; Regression Modeling; Equilibrium Temperature; NASA Exoplanet Archive; Predictive Analytics
Introduction
An exoplanet is a planet orbiting around stars beyond our Solar System. Many exoplanets have been discovered since the initial findings of extrasolar planets were confirmed, using different techniques like transit photometry, radial velocity observations, direct imaging, and microlensing. The discoveries made so far have helped increase knowledge about planetary systems, and many questions about their formation, orbits, and habitability remain unanswered.
One of the features used to study planets is equilibrium temperature. Equilibrium temperature serves as an estimation of the temperature a planet would expect depending on energy coming from its host star. Though it cannot accurately tell about surface temperatures, it can be useful in determining which planets are situated at certain temperatures that may be considered habitable.
The goal of this research is to construct a model to predict exoplanet equilibrium temperature based on different measurable variables. Such a topic is appropriate for the task due to the nature of the outcome variable (numeric), as well as the availability of possible predictors provided by the NASA Exoplanet Archive.
Literature Review
Applications of machine learning and regression analysis are significant for astrophysics research involving exoplanet and planetary system analysis. For example, machine learning and statistical methods have been used to predict certain parameters of planets, classify exoplanets, and identify habitable planets based on observations obtained by astronomical surveys (James et al., 2021).
Regression analysis is usually utilized in establishing the relationship between the characteristics of stars and planets. Regression analysis is effective in analyzing linear relationships and testing predictor significance. However, some astrophysical systems feature non-linear relationships that require other approaches to analyze them (Kuhn & Johnson, 2013).
In more recent times, researchers have attempted machine learning approaches like random forests, artificial neural networks, and support vector machines to analyze exoplanets. Such methods are effective at recognizing complicated nonlinear connections between variables, and usually perform better than conventional regression approaches when analyzing observational data sets. The random forest algorithm is especially useful since it not only identifies nonlinear relationships but also assesses the relative importance of each predictor (Hastie, Tibshirani, & Friedman, 2009).
The current research contributes to the literature by applying both regression and machine learning algorithms to model the continuous equilibrium temperature of an exoplanet. The research differs from other studies which mainly concentrate on classification algorithms for modeling exoplanets. In addition, the study tries to determine whether there are any benefits to be gained by utilizing a nonlinear approach in machine learning algorithms.
Research Question
Which stellar, orbital, and planetary characteristics are most important for predicting exoplanet equilibrium temperature?
Methodology
In this analysis, data on confirmed exoplanets is used, sourced from the NASA Exoplanet Archive. This analysis seeks to predict the equilibrium temperature of an exoplanet based on certain predictor variables that characterize the planet and its star. Regression modeling techniques were used to analyze the relationship between equilibrium temperature and the selected predictor variables.
The data was extracted from the NASA Exoplanet Archive Planetary Systems Composite Parameters table, which contains data on confirmed exoplanet observations and their stellar system characteristics.
Data preparation and cleaning were performed in R, with predictor variables that had too many missing values eliminated and those observations of predictor variables with missing values removed from the modeling data set. Variable distributions and correlations were examined through exploratory data analysis.
Dataset Variables
Exoplanet equilibrium temperature (pl_eqt) was used as the response variable for this study. Predictor variables have been chosen according to their anticipated physical relationship with planetary temperature.
Selected predictor variables include:
- Stellar effective temperature (
st_teff) - Stellar mass (
st_mass) - Stellar radius (
st_rad) - Semi-major axis (
pl_orbsmax) - Orbital period (
pl_orbper) - Planet radius (
pl_rade) - Planet mass (
pl_bmasse) - Orbital eccentricity (
pl_orbeccen)
Exploratory Data Analysis
Missing data were also assessed in the dataset after the preprocessing stage. No missing values existed in the selected attributes in the final cleaned dataset.
The exoplanet equilibrium temperature distribution is skewed to the right. The majority of the planets have equilibrium temperatures ranging from around 400 K to 1200 K. There are also a few planets with very high equilibrium temperatures, causing the distribution to have a long right tail. This implies the existence of potential outliers, implying that certain planetary systems are exposed to a much higher amount of radiation from their stars compared to other planetary systems.
A number of significant relationships can be found from the correlation matrix. The equilibrium temperature of exoplanets (pl_eqt) is positively correlated with stellar temperature, stellar mass, stellar radius, and planetary radius. Such relationships imply that the higher the stellar temperature, stellar mass, stellar radius, and planetary radius, the higher the equilibrium temperature of planets.
In addition, there exist very high positive correlations between several stellar features, especially stellar temperature, stellar mass, and stellar radius. These relationships imply multicollinearity between predictor variables that might influence the stability of the coefficients in the multiple linear regression models. Thus, regularization methods such as ridge and lasso regression can be used later during model training.
Finally, orbital distance (pl_orbsmax) and orbital period (pl_orbper) have a very high positive correlation. It corresponds to well-known relationships of orbital mechanics.
The scatterplot indicates a positive relationship between stellar temperature and exoplanet equilibrium temperature. Planets orbiting hotter stars generally exhibit higher equilibrium temperatures, which is consistent with the physical expectation that hotter stars emit greater amounts of stellar radiation.
The relationship also displays increasing variability at higher stellar temperatures, suggesting that additional planetary and orbital factors influence equilibrium temperature. Several extreme observations are also visible, indicating the presence of potentially influential outliers within the dataset.
From the log-scale scatter plot, there appears to exist a very clear negative correlation between orbital distance and exoplanet equilibrium temperature. Exoplanets that lie nearer to their stars have considerably high equilibrium temperatures because of the high levels of stellar energy received by such planets. The higher the orbital distance, the lower the equilibrium temperature.
In an attempt to represent the extensive range of orbital distances, I used a logarithm scale for the orbital distance or semi-major axis on the graph. From the curve formed from the scatter plot, it is evident that the orbital distance will play a very crucial role in prediction within the regression model.
Regression Modeling
The data was split into two sets, that is, the training set and the testing set, through an 80/20 split. The training set was applied for modeling purposes while the testing set was employed for testing purposes only.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 45.591 | 54.670 | 0.834 | 0.404 |
| st_teff | 0.053 | 0.017 | 3.082 | 0.002 |
| st_mass | 426.962 | 64.943 | 6.574 | 0.000 |
| st_rad | 90.816 | 24.327 | 3.733 | 0.000 |
| pl_orbsmax | -254.960 | 15.275 | -16.691 | 0.000 |
| pl_orbper | 0.138 | 0.009 | 15.435 | 0.000 |
| pl_rade | 27.796 | 1.622 | 17.133 | 0.000 |
| pl_bmasse | 0.044 | 0.012 | 3.652 | 0.000 |
| pl_orbeccen | -446.508 | 59.683 | -7.481 | 0.000 |
The results of the multiple linear regression showed a set of statistically significant variables affecting exoplanet equilibrium temperature. Stellar temperature, stellar mass, and stellar radius were positively correlated with the dependent variable, meaning that the equilibrium temperature was higher in planetary systems around more massive and hotter stars.
A strong negative correlation between orbital distance (pl_orbsmax) and equilibrium temperature meant that more distant planets received less radiation from their host stars and had, accordingly, lower temperatures. Negative correlation was also observed for orbital eccentricity and equilibrium temperature.
Positive correlations were found for both planetary radius and planetary mass, implying that the equilibrium temperature was higher on average in planetary systems featuring larger planets within the sample.
Statistical significance was reached by most of the predictors at the 0.05 level.
| Model | RMSE | MAE | \(R^2\) |
|---|---|---|---|
| Linear Regression | 350.006 | 259.251 | 0.427 |
According to the multiple linear regression model, the testing dataset results are moderate in terms of prediction accuracy, with an \(R^2\) score of approximately 0.427, meaning the regression model accounts for about 43% of the variation in the equilibrium temperature of the exoplanets.
The obtained values for the Root Mean Squared Error (RMSE) and the Mean Absolute Error (MAE) show that errors will still be evident in predictions. Despite the numerous dependencies and relationships that are captured by the linear regression model, other factors may contribute to the equilibrium temperature beyond those considered within the linear model.
The comparison of the predicted against the actual shows that the multiple linear regression model explains the overall relationship between the observed values and the predicted equilibrium temperature values. Most observations lie within the vicinity of the reference line, implying that the model has decent prediction capabilities for a majority of the data.
The distance among the observations becomes wider as equilibrium temperatures rise, showing that the model faces some challenges in predicting planets with more extreme temperatures. There are also some observations which lie far away from the reference line, implying that there may be some outliers in the data.
The plot of residuals illustrates that the residuals are scattered around zero values, thus demonstrating that the linear regression approach identifies most of the fundamental relations between variables. There does not seem to be any noticeable pattern, which further indicates the correctness of the chosen regression method.
Nevertheless, there is an increase in the variance of the residuals in some cases of higher estimated equilibrium temperatures, implying possible heteroscedasticity and nonlinear relationships between variables. There are also several high residuals, implying that some planetary systems are not captured by the linear regression approach.
Advanced Regression Models
| Model | RMSE | MAE | \(R^2\) |
|---|---|---|---|
| Ridge Regression | 360.803 | 268.428 | 0.388 |
The ridge regression model exhibited predictive accuracy similar to the standard multiple linear regression model, although the model had slightly less predictive accuracy in terms of RMSE, MAE, and \(R^2\) values. This indicates that even though there is multicollinearity between several variables, the technique of shrinking the coefficients did not have much effect on the performance of the model.
Even though the predictive accuracy was slightly reduced, ridge regression is still helpful in stabilizing coefficients.
| Model | RMSE | MAE | \(R^2\) |
|---|---|---|---|
| Lasso Regression | 365.669 | 272.591 | 0.372 |
Lasso regression was found to perform less well compared to the baseline regression and ridge regression methods. The poor predictive performance indicates that the process of pushing coefficients toward zero has resulted in the loss of information needed to make accurate predictions.
In spite of its poor predictive performance, lasso regression helps us understand the importance of different features in terms of their impact on the dependent variable. Some predictors still possess large coefficients, such as the star’s mass, radius, distance, and planet’s radius.
| Variable | Coefficient |
|---|---|
| (Intercept) | 116.587 |
| st_teff | 0.030 |
| st_mass | 485.170 |
| st_rad | 73.219 |
| pl_orbsmax | -61.009 |
| pl_orbper | 0.025 |
| pl_rade | 26.849 |
| pl_bmasse | 0.005 |
| pl_orbeccen | -399.828 |
Coefficient shrinkage was used in the lasso regression model to minimize the complexity of the model and automate the process of feature selection. Some of the predictors showed considerable weights, which include stellar mass, stellar radius, orbital distance, planetary radius, and orbital eccentricity, and thus these variables still contribute significantly to the prediction of the equilibrium temperature.
Predictors like orbital period (pl_orbper) and planetary mass (pl_bmasse) have been shrunk significantly to near zero. This implies that these predictors did not contribute significantly to the predictive power of the regression model after other predictors.
From this feature selection process, lasso regression can be deemed useful in finding the significant predictors for regression analysis.
| Model | RMSE | MAE | \(R^2\) |
|---|---|---|---|
| Random Forest | 110.255 | 62.109 | 0.944 |
In comparison with the previous regression models, the random forest regression method has proven superior with regard to all the measures of assessment. This model demonstrated an \(R^2\) value of around 0.944. Therefore, one can say that almost 94% of the variance of equilibrium temperature is captured by the model.
The dramatic difference between the measures of both types of regressions points to the conclusion that non-linear dependencies and interactions between stars, orbits, and planets have a major impact on the equilibrium temperature. While in the case of linear regression, the algorithm is unable to detect these kinds of relations within the data.
This result demonstrates the effectiveness of using machine learning models in such cases.
From the random forest variable importance analysis, pl_orbper and pl_orbsmax, indicating orbital period and orbital distance, respectively, are the two most important variables for predicting equilibrium temperature of the exoplanets. The properties of the stars, such as stellar radius, stellar temperature, and stellar mass, are also highly relevant.
This finding aligns well with the general expectations from astrophysics, as the equilibrium temperature of planets depends on the properties of the parent star and the orbital parameters of the planets. Planetary mass and orbital eccentricity were relatively insignificant compared to other variables in the random forest model.
The variable importance values validate our conclusion that the nonlinear relationship between these variables is significant for equilibrium temperature predictions.
| Model | RMSE | MAE | \(R^2\) |
|---|---|---|---|
| Linear Regression | 350.006 | 259.251 | 0.427 |
| Ridge Regression | 360.803 | 268.428 | 0.388 |
| Lasso Regression | 365.669 | 272.591 | 0.372 |
| Random Forest | 110.255 | 62.109 | 0.944 |
The comparisons of the model predictions illustrate considerable discrepancies between the performance of each of the regression models under consideration. The multiple linear regression model demonstrated a decent predictive power, whereas ridge and lasso regression models generated lower results in terms of RMSE, MAE, and \(R^2\).
Of all the tested models, the random forest regression model demonstrated much higher performance than other methods. The method delivered the smallest errors in predictions and yielded the largest value of \(R^2\), which accounted for about 94% of the variance in the target variable.
Thus, the results demonstrate that non-linear dependencies between the parameters of exoplanets, their stars, and orbits significantly influence the equilibrium temperature of planets. Although the linear and regularization methods detected certain associations between exoplanet parameters, machine learning methods appear to perform better in complex astrophysical problems.
Conclusion
The current research explored the ability of regression and machine learning methods to make accurate predictions of equilibrium temperatures for exoplanets based on various features related to their stellar and orbital characteristics, which were collected from the NASA Exoplanet Archive website. An exploratory analysis found statistically significant associations between the target feature and some of the predictors, such as the temperature of the star, its mass, the distance of the orbit, and its period.
Four types of regression models, namely multiple linear, ridge, lasso, and random forest regressions, were estimated by calculating RMSE, MAE, and \(R^2\). The multiple linear regression model exhibited acceptable predictive accuracy, implying that there are multiple significant linear dependencies between the target variable and predictors. However, regularized regression models were no more accurate than the baseline linear regression model.
The random forest regression model proved to be the most effective among all methods, clearly outperforming other models according to all assessment criteria. This model captured nearly 94% of the variance in equilibrium temperature, meaning that non-linear dependencies between predictors and the target variable are significant.
The results from this study reveal the efficiency of using machine learning algorithms in analyzing astrophysical datasets. This study also highlights the role of stellar radiation and the arrangement of orbits in the environment of an exoplanet.
A number of shortcomings exist in this study as well. The dataset used in this study carries some observation errors, possible outliers, and variables which do not cover all the physical factors affecting the equilibrium temperature of a planet. Other future studies may consider testing other types of machine learning algorithms or employing more features to improve model accuracy.
In summary, this study illustrates the applicability of statistical methods and machine learning techniques on astrophysical datasets.
References
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning. Springer.
Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer.
NASA Exoplanet Archive. Planetary Systems Composite Parameters Table. California Institute of Technology. Retrieved from https://exoplanetarchive.ipac.caltech.edu/
Appendix
The following appendix contains selected R code used for data preparation, regression modeling, and machine learning analysis.
Data Preparation Code
Code
library(tidyverse)
library(caret)
library(glmnet)
library(randomForest)
library(broom)
library(corrplot)
exoplanets <- read_csv(
"exoplanets.csv",
comment = "#"
)
exo_clean <- exoplanets %>%
select(
pl_eqt,
st_teff,
st_mass,
st_rad,
pl_orbsmax,
pl_orbper,
pl_rade,
pl_bmasse,
pl_orbeccen
) %>%
drop_na()Train/Test Split
Code
set.seed(621)
train_index <- createDataPartition(
exo_clean$pl_eqt,
p = 0.8,
list = FALSE
)
train_data <- exo_clean[train_index, ]
test_data <- exo_clean[-train_index, ]Multiple Linear Regression Code
Code
lm_model <- lm(
pl_eqt ~ st_teff +
st_mass +
st_rad +
pl_orbsmax +
pl_orbper +
pl_rade +
pl_bmasse +
pl_orbeccen,
data = train_data
)
summary(lm_model)Random Forest Regression Code
Code
rf_model <- randomForest(
pl_eqt ~ .,
data = train_data,
importance = TRUE
)
rf_preds <- predict(
rf_model,
newdata = test_data
)
rmse_rf <- RMSE(rf_preds, y_test)
mae_rf <- MAE(rf_preds, y_test)
r2_rf <- R2(rf_preds, y_test)