Introduction

Extrasolar planets, or exoplanets for short, are planets which exist around other stars beyond our own. Currently, it is estimated that approximately 70% of stars have at least one planet orbiting them. This means that there are a lot of potential exoplanets waiting to be discovered and studied. In fact, as of December 10, 2024, there are 5,806 total confirmed exoplanets, with many more awaiting further analysis and potential confirmation, (NASA Exoplanet Archive). Although the process of identifying exoplanets is a relatively new scientific pursuit within just the past few decades, scientists have confirmed the existence of thousands of exoplanets as of today.

One of the main methods of exoplanet detection is known as the Transit Method. This method involves looking for exoplanets based upon the dimming effect they have on their star. As an exoplanet orbits its star, it will pass directly in front of its star at one point which will result in a temporary dimming of the light emitted from the star until the exoplanet moves so that it is no longer in front of its star. This moment in the planet’s orbit partially blocks out some of the sunlight to the view of the observer, which indicates that a planet is passing in front of the star. This event of a planet passing directly in front of its star is known as a planetary transit, which is where the name of the transit method comes from.

There are a couple important factors that impact how the transit method works, and what conditions make it more likely for an exoplanet to be spotted. First, this transit method works best for detecting large exoplanets. A larger planet would block out more light from its star than a smaller planet would during a planetary transit. So, the majority of exoplanets that scientists have detected and confirmed tend to be larger planets, because it would be easier to detect the light blocked by a large planet as opposed to a small one. Additionally, the vast majority of exoplanets that have been detected orbit smaller stars. Similarly to a larger planet causing a more notable drop in sunlight, a smaller star also shows a greater dip in light than a larger star would. A planet would block out more total sunlight from a very small star than it would from a very large star, so the exoplanets that have been detected so far tend to orbit small stars.

It takes three observed planetary transits before an exoplanet candidate can be confirmed. Due to this, the vast majority of the exoplanets which scientists have confirmed so far are planets which are close in distance to their stars, as these planets have shorter orbital periods.

Data Description

For my final project, I looked at a data set which is a record of all of the observed objects of interest in the Kepler Space Observatory’s search for exoplanets.

This data set was found on kaggle.com on the following webpage: https://www.kaggle.com/datasets/nasa/kepler-exoplanet-search-results

I chose the topic of astronomy for my final project because it is something which I find very interesting. I thought that this data set which looks at the exoplanet findings and their unique factors was interesting because it shows the wide variety of findings that have been made regarding the search for exoplanets, and their specific characteristics.

We will read in the data set and call it “exoplanet”.

This data set has 9,564 observations of 50 variables.

Variables

There are 50 total variables in this data set.

A key focus of this data set is the differentiation between exoplanet candidates that have been identified by scientists, and exoplanet candidates that have been officially confirmed by the Kepler data analysis.

The variable koi_disposition represents the disposition towards a potential exoplanet finding in literature. This variable refers to the opinions of scientists regarding a potential exoplanet candidate based upon their findings. This variable can take on the values of CANDIDATE, FALSE POSITIVE, NOT DISPOSITIONED or CONFIRMED.

The variable koi_pdisposition represents the disposition the Kepler data analysis has towards this exoplanet candidate. This variable represents the official findings of the Kepler Space Observatory’s data and whether it confirms the exoplanet candidate that was identified by scientists, or if the finding was a false positive and is not actually an exoplanet. This variable can take on the values of FALSE POSITIVE, NOT DISPOSITIONED, and CANDIDATE.

We will treat this variable as our response variable in future steps where we create supervised learning models. This is a categorical variable, so we will use classification methods in these future steps.

Another key variable for the identification of potential exoplanets is the variable koi_score. This variable is a value between 0 and 1 that indicates the confidence in the KOI disposition of a potential finding. We can use this variable as a potential response variable during some supervised learning steps if we are looking for a possible numeric, response variable to look at for creating regression models.

Exploratory Data Analysis

Before we begin with conducting some unsupervised and supervised learning procedures on this data set, we will start by doing exploratory data analysis to check the data we have.

Of the variables in the data set, one which looks like it would be an ideal response variable for later steps is the variable koi_pdisposition. This variable represents whether the Kepler data analysis confirmed an exoplanet sighting as an official exoplanet, or if it turned out to be a false positive. This variable is a categorical variable, and it would be a good one to look at during the supervised learning steps to use classification to see with how much accuracy we can possibly classify an observation as an exoplanet or a false positive based upon its independent variables.

Let’s make a table to see how many observations fall into each of the potential categories of the koi_pdisposition variable.


     CANDIDATE FALSE POSITIVE 
          4496           5068

It appears that the koi_pdisposition variable is pretty evenly split between confirmed exoplanet candidates and false positives. There are 4,496 observations that are official candidates, and 5,068 observations that are false positives. There seems to be a pretty even split between candidates and false positives, so it does not appear that the variable koi_pdisposition is imbalanced.

One variable which caught my interest is the variable koi_period. This variable represents the orbital period of an observation. This represents the time it takes the potential exoplanet finding to complete one orbit around its star. The measurements for this variable are given in days.

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
     0.24      2.73      9.75     75.67     40.72 129995.78

Something appears to be off in regards to the koi_period variable, because it has a maximum value of 129,995.78 days. This is an incredibly long orbital period, and is equivalent to an orbital period of 356.15 years. Exoplanets have not been actively being studied for 356.15 years, so this value is certainly both an outlier to the data and an error. We will certainly address this outlier in the pre-processing steps, because including it would likely lead to inaccuracies in our findings regarding this variable.

Another variable which would be interesting to look at the distribution of is the variable koi_duration. This variable represents the duration, in hours, of the observed planetary transit. This means that this variable measures the time from when the potential exoplanet first passes in front of the star, until it last passes in front of the star. This represents the total duration in time which the light emitted from the star is partially blocked by the planetary transit.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.052   2.438   3.793   5.622   6.277 138.540

The overall distribution of the koi_duration variable indicates that most of the transit durations that have been observed are relatively short, with the vast majority being between 0 and 10 hours. The shortest transit duration observed was just 0.052 hours, or 3.12 minutes. This means that the duration of which the planet passes in front of its star was just a few minutes long.This is most likely a planet which orbits very close to its star, so its orbits go by very quickly. The median transit duration was 3.793 hours. The maximum transit duration was 138.540 hours, or 5.7725 days. Although this value is much larger than the average, it is not something which we can say is definitively an outlier, because this is likely just an exoplanet sighting that orbits further from its star, so it takes longer to complete a transit.


     CANDIDATE      CONFIRMED FALSE POSITIVE 
          2248           2293           5023

Pre-Processing

Before we begin with our unsupervised and supervised learning steps, we will ensure that the data is in the best state to do so. We will check for missing values as well as any potential outliers that could lead to issues with the data set.

Fixing the Missing Values

First, let’s check for missing values in our exoplanet data set.

It turns out that there are quite a lot of variables with missing values, so we must address this. Something which immediately stood out to me is that there are two variables, koi_teq_err1 and koi_teq_err2, with 9,564 missing values. There are exactly 9,564 observations in this data set, so that means these two variables are missing all of their values. Since these two variables are missing all of their values, it would likely lead to inaccurate results to try and fill them in with a method like multiple imputation. So, we will drop these two variables from the data set, and then we will look at the other variables that are missing values.

We will drop the variables koi_teq_err1 and koi_teq_err2 from the data set, because they are missing every value out of all of the observations. We will create a new data set, called “exoplanet0” to store all of the other variables after dropping the koi_teq_err1 and koi_teq_err2 variables.

The remaining variables with missing values mostly seem to have somewhere around 350 to 500 missing values. Given that our data set is so large with 9,564 observations, we can proceed with filling in these missing values through multiple imputation. The variable koi_score has 1,510 missing observations which does stand out as a larger amount, but it does still have the majority of its values, so we can proceed with keeping that variable in our data set.

We will use multiple imputation to fill in the missing values in the data set. We will use the mice function in order to fill in these missing observations.

We will store the data set with the missing values filled in through the multiple imputation process in a new data set called “exoplanet1”. We will convert the results of the multiple imputation from a large mids object back to a regular data frame called “exoplanet1”.

Let’s ensure that our new data set “exoplanet1” has no missing values. Since we filled in all of the missing observations through the multiple imputation process, we should get that each variable has zero missing values.

It looks like the multiple imputation filled in nearly all of the variables with missing values. However, the variables koi_period_err2, koi_time0bk_err2, koi_duration_err2, and koi_depth_err2 all still have 454 missing values even after running the multiple imputation function. Since multiple imputation did not work to fill in these particular missing values, we will omit these last remaining missing observations from the data set. We will call the final data set with no more missing observations “exoplanet2”.

Now, we will make sure that we have no more missing observations.

All of the variables have exactly zero missing observations, so we have successfully addressed the concern of missing observations by either using multiple imputation, or omitting the observations which could not be filled in even with multiple imputation.

Fixing the Outliers

In the exploratory data analysis step, we came across a few variables which showed some concerns regarding outliers.

One of the variables which stood out as having a problem was the koi_period variable. This variable represents the orbital period of the potential exoplanet, or the time it takes for it to complete one orbit around its star. We found that the maximum orbital period in our data set was 129995.78 days, or 356.15 years. This is certainly an error because exoplanet studies have not been ongoing for 356.15 years. The detection and analysis of exoplanets is a relatively new science, and so there is no possible way that a planetary transit taking 356.15 years to complete its orbit was observed and recorded.

We will remove this outlier from our data set, because it is significantly skewing the koi_period variable distribution as it is so much larger than the other observed orbital periods.

Now we have removed that particular outlier from our data.

Just to be certain we have fixed the problem with this variable, let’s look back at its distribution and summary once again now that the major outlier has been removed.

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.2418    2.6964    9.3365   60.3207   37.1011 2190.7010

The maximum orbital period is now 2,190.70 days or 6.002 years. While this value is still relatively large in comparison to the rest of the data, and does appear to be an outlier, it is much more believable that a potential exoplanet has an orbital period of around 6 years, than the outlier found in the exploratory data analysis step was. So, perhaps this is just an unusually large finding, rather than something to be concerned about.

Another thing that can be seen by looking at the distribution of the koi_period variable is that it appears to be skewed right. This is not too big of a concern as it makes sense that majority of the observations would be closer to zero while there are a few further out. Most of the exoplanets which have been discovered so far have short orbital periods because they are easier to detect. So this distribution makes sense as majority of the observations would tend to have shorter orbital periods, while there are less, but still some, observations that are further out.

Another variable which would be interesting to look at the distribution of, and check for potential outliers, is the variable koi_srad. This represents the radius of the star which the potential exoplanet finding orbits. These stellar radius measurements are given in the number of solar radius which the star is in size. For reference, one solar radius is equal to the radius of our sun.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.109   0.829   1.000   1.691   1.339 180.013

As we can see from the histogram, most of the stars of the potential exoplanet findings appear to be relatively small in their radius. Most observations have a solar radius between 0 and 10, while fewer stars have significantly greater solar radius measurements. In fact, the minium stellar radius observed was just 0.109 solar radius, meaning this star had just 10% of the radius of our sun.

The distribution of the koi_srad variable does appear to have some skew to the right, with some stellar radius measurements which are much further to the right than the majority of observations. However, this does not appear to be too much of a practical concern, because the vast majority of exoplanets that have been observed do orbit around smaller stars. The reason for why it is not too surprising that the majority of the observations of stellar radius would be small values, because it is much easier to detect exoplanets that orbit smaller stars. This is because it is easier to detect dips in light of smaller stars when an exoplanet passes in front of it than it would be to detect if it were a much larger star that the exoplanet was orbiting. Smaller stars experience greater decreases in light when an exoplanet passes in front of it, so that makes it easier to detect exoplanets that orbit much smaller stars due to the change in light being much more notable than if it were a larger star.

Another variable which would be good to check the distribution of to make sure there is not any significant skew would be the variable koi_score. This variable represents a value between 0 and 1 that indicates the confidence in the koi_disposition. This means that this score represents how confident the results of the disposition for this particular observation is.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.2460  0.4646  0.9970  1.0000

The distribution of the varaible koi_score appears to be bimodal with peaks at both 0 and 1. Although this is certainly not a normal distribution for this variable, it can be explained by the nature of how this score is given. For candidates, a higher score indicates more confidence in its classification, and a lower score indicates less confidence. However, for false positives, a higher value in fact indicates less confidence, while a lower score indicates more confidence. Since we saw that our data is pretty evenly distributed between candidates and false positives, this difference in the meaning of the koi score is likely what is resulting in this bimodal appearance, since the scores mean something different for candidates than they do for false positives.

Overall, it looks like we have addressed the major concerns for the data set, such as missing values and outliers, with these pre-processing steps, so now we can move onto the unsupervised and supervised learning steps.

Unsupervised Learning

Now, we will begin with doing some unsupervised learning steps on our data set.

We will start with running principal component analysis, PCA. To run PCA, we will create a subset of the data set with all of the variables to look just at the numeric variables. We will omit any character variables from this subset.

We will call this subset of just the numeric variables “exoplanet.num”. All of the numeric variable in the data set have two corresponding error variables, so in order to be more concise, we will look just at the main numeric variables, not the two error terms as well.

PCA Without Scaling

We will begin with running PCA on the data. Running PCA will allow us to determine how much of the variation in the data set can be explained by the principle components. We will run the PCA without any scaling for now.

Importance of components:
                             PC1       PC2       PC3       PC4       PC5
Standard deviation     1.576e+05 8.250e+04 3092.9481 881.60621 676.18794
Proportion of Variance 7.845e-01 2.151e-01    0.0003   0.00002   0.00001
Cumulative Proportion  7.845e-01 9.997e-01    1.0000   0.99997   0.99999
                             PC6   PC7   PC8   PC9  PC10  PC11 PC12  PC13
Standard deviation     628.34709 112.8 5.936 4.693 4.282 3.591 2.46 1.221
Proportion of Variance   0.00001   0.0 0.000 0.000 0.000 0.000 0.00 0.000
Cumulative Proportion    1.00000   1.0 1.000 1.000 1.000 1.000 1.00 1.000
                         PC14   PC15
Standard deviation     0.4015 0.2529
Proportion of Variance 0.0000 0.0000
Cumulative Proportion  1.0000 1.0000

The first principal component, PC1, accounts for 78.45% of the total variation. The second principal component, PC2, accounts for an additional 21.51% of the total variation. So, the first two principal components account for 99.97% of the total variation in the data set.

We will take a look and see which variable had the highest absolute value of their loading value for PC1.

The variable with the highest absolute value of their loading value for PC1 is koi_insol with a loading value of 0.99999. All of the other variables have very small loading values that are close to zero. The reason why this variable has the largest absolute value of its loading value, is because this is the variable which most strongly relates to the first principle component, PC1. Additionally, this means that this koi_insol variable has the largest variance.

PCA With Scaling

Now, we will run PCA again, but this time with scaling the variables.

Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7
Standard deviation     1.6528 1.3764 1.3216 1.2241 1.12423 1.03913 0.99454
Proportion of Variance 0.1821 0.1263 0.1164 0.0999 0.08426 0.07199 0.06594
Cumulative Proportion  0.1821 0.3084 0.4249 0.5248 0.60903 0.68101 0.74695
                           PC8    PC9    PC10    PC11    PC12    PC13   PC14
Standard deviation     0.89051 0.8316 0.79762 0.71240 0.62066 0.56434 0.5253
Proportion of Variance 0.05287 0.0461 0.04241 0.03383 0.02568 0.02123 0.0184
Cumulative Proportion  0.79982 0.8459 0.88833 0.92217 0.94785 0.96908 0.9875
                          PC15
Standard deviation     0.43340
Proportion of Variance 0.01252
Cumulative Proportion  1.00000

With scaling the variables, the proportion of the total variation explained by the first principal component, PC1, is 18.21%. The second principal component, PC2, accounts for an additional 12.63% of the total variation. So, the first two principal components account for 30.84% of the total variation.

We will see which variables had the highest absolute value of their loading values for the first two principal components with scaling the variables.

The variable with the highest absolute value of its loading value for PC1 is koi_slogg with a loading value of -0.4938. The variable with the second highest absolute value of its loading value for PC1 is koi_teq with a loading value of 0.46608.

The variable with the highest absolute value of its loading value for PC2 is koi_model_snr with a loading value of -0.4727. The variable with the second highest absolute value of its loading value for PC2 is koi_depth with a loading value of -0.4635.

Supervised Learning

Now, we will move on to some supervised learning steps for our data set.

We will create a new data set which includes all of the numeric variables we looked at during the PCA, but also add back in the categorical variables of koi_pdisposition which we will use later on in supervised learning for a classification step. We will call this data set “exoplanet3”.

OLS Regression Model

We will begin with creating a linear regression model and fitting this to our data. We will have to a numeric variable as the response variable for this linear regression. Out of the numeric variables, one which would make a good choice as a response variable would be koi_score. This variable represents the confidence in the resulting disposition of an observation. So, we can create an OLS regression model to see if we can statistically significantly predict the koi_score of an observations based upon its various factors.

Let’s begin with making the OLS regression model.


Call:
lm(formula = koi_score ~ ., data = exoplanet3)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.94655 -0.04978  0.01003  0.06596  0.98700 

Coefficients:
                                 Estimate Std. Error  t value Pr(>|t|)    
(Intercept)                     1.691e+00  9.904e-02   17.072  < 2e-16 ***
koi_period                     -3.153e-04  1.461e-05  -21.581  < 2e-16 ***
koi_impact                      1.415e-03  6.112e-04    2.315 0.020662 *  
koi_duration                   -7.117e-04  2.637e-04   -2.699 0.006973 ** 
koi_depth                      -9.859e-08  2.293e-08   -4.300 1.72e-05 ***
koi_prad                       -6.969e-07  6.637e-07   -1.050 0.293760    
koi_teq                        -2.945e-05  2.903e-06  -10.146  < 2e-16 ***
koi_insol                       4.660e-08  1.219e-08    3.822 0.000133 ***
koi_model_snr                  -1.116e-05  2.372e-06   -4.706 2.57e-06 ***
koi_steff                      -1.306e-06  2.204e-06   -0.593 0.553461    
koi_slogg                      -2.582e-02  5.884e-03   -4.389 1.15e-05 ***
koi_srad                       -8.271e-04  4.278e-04   -1.933 0.053233 .  
ra                             -1.967e-03  3.219e-04   -6.111 1.03e-09 ***
dec                             4.663e-04  4.186e-04    1.114 0.265283    
koi_kepmag                     -3.528e-03  1.339e-03   -2.635 0.008419 ** 
koi_pdispositionFALSE POSITIVE -8.600e-01  3.448e-03 -249.431  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1431 on 9093 degrees of freedom
Multiple R-squared:  0.9069,    Adjusted R-squared:  0.9067 
F-statistic:  5903 on 15 and 9093 DF,  p-value: < 2.2e-16

This OLS regression model, with a response variable of koi_score, is statistically significant with a p-value of p < .001. This means that the independent variables were statistically significant in predicting the koi_score of an observation.

We will interpret the regression coefficients of the variables koi_period, koi_duration, and koi_pdispositionFALSEPOSITIVE.

The variable koi_period has a regression coefficient of -0.000315. This means that for every 1 day increase in the orbit period of an observation, its koi_score decreases by 0.000315 units, holding all the other variables constant. There is a negative relationship between koi_period and koi_score.

The variable koi_duration has a regression coefficient of -0.000712. This means that for every 1 hour increase in the transit duration of an observation, its koi_score decreases by 0.000712 units, holding all the other variables constant. There is a negative relationship between koi_duration and koi_score.

The variable koi_pdisposition is a categorical variable, so it has been converted to dummy variables. There were two categories included in the data set, CANDIDATE and FALSE POSITIVE. CANDIDATE was chosen as the base level, because it comes first alphabetically. So, our interpretation for the regression coefficient of the dummy variable, koi_pdispositionFALSEPOSITIVE, will be in relation to the base level.

The variable koi_pdispositionFALSEPOSITIVE has a regression coefficient of -0.8600. This means that the koi_score of on observation is, on average, 0.8600 units less for false positives than for candidates, holding all the other variables constant.

OLS Regression

We will now run OLS regression to determine the test RMSE value for this OLS regression model. We will use a 5-fold CV for this.

The OLS regression model gives a test RMSE value of 0.1437. The test RMSE value represents the difference between the predicted and the actual values of the observations made by this model. So, on average the predictions made by our OLS model were off by 0.1437 units of kpi_score from their actual values.

LASSO Model

We will use a LASSO model instead of OLS regression. We will calculate the test RMSE of this LASSO model, and see if it did better or worse than the OLS regression model did. We will center and scale the variables.

The optimal value of lambda that was chosen for this LASSO model is lambda = 0.01. The test RMSE value for this lambda is 0.1453. This test RMSE value means that the estimated difference between the predicted and the actual values found by this LASSO model is 0.1453. So, on average the predictions made by our LASSO model were off by 0.1453 units of koi_score from their actual values.

We did worse with this LASSO model than we did with the OLS regression model. The test RMSE of the LASSO model was 0.1453, which was more than the test RMSE of the OLS regression model, which was 0.1437. This means there was more difference between the actual and the predicted values of the LASSO model than there were for the OLS regression model.

Pruned Tree

We will make a pruned tree for our regression model. We will use 10 possible tuning parameter values to prune the tree.

The optimal model occurs using a cp = 0.000947. This results in a test RMSE value of 0.1233 for the pruned tree. This means that the predictions made by this pruned tree were off by 0.1233 units of koi_score from their actual values. This pruned tree did the best of the models we have looked at so far because it has the lowest test RMSE.

Now, let’s plot the pruned tree.

We can see that the first split of the pruned tree was made on koi_pdispositionFALSEPOSITIVE >= 0.5.

Out of the models we looked at, the pruned tree model did the best, because it resulted in the lowest value of the test RMSE. This means that this model had the lowest difference bewteen the actual and the predicted values.

Classification

Now, we will use classification to create a model with a categorical response variable. We will use koi_pdisposition as our response variable, and we will see how accurately we can classify potential exoplanet findings.

First, we will change the classification of “FALSE POSITIVE” to “FALSEPOSITIVE” to avoid an error that occured when having the first as the name in the data set.

KNN

We will run KNN with k = 1, …, 11. We will find what the optimal value of k is, and what the test accuracy is at this optimal value. We will center and scale the variables to ensure they are on the same scale.

The optimal value of k is k = 7, and this resulted in an accuracy of 0.9649. This value represents the accuracy of our model, meaning the proportion of times which the model correctly classified the observations. So, this KNN model has an accuracy of 0.9649, or 96.49%. This means that we correctly classified whether an observation was a candidate or a false positive 96.49% of the time.

Classification Tree

We will now fit a classification tree. We will let the train() function pick 10 different values of the tuning parameter, and we will plot the final tree.

The optimal model for the classification tree used a cp = 0.000528. The accuracy for this optimal model is 0.9719, or 97.19%. This means that this classification tree correctly classified whether the observation is a candidate or a false positive 97.19% of the time.

Now, let’s plot the classification tree.

The first split in the classification tree was at koi_score >= 0.411.

Out of Bag Model

We will run a random forest model with m = 1, …, 5 using out of bag (OOB) approach. We will find the optimal value of m, and the test accuracy value for this model.

For this random forest model with the out of bag approach, the optimal value of m is m = 5. This results in an accuracy of 0.9745, or 97.45%. This means that this model correctly classified whether an observation was a candidate or a false positive 97.45% of the time.

This out of bag model did the best out of all of the classification models we tried, because it has the highest accuracy percentage.

Let’s see what the most important variable was.

The most important variable to the out of bag model was koi_score with an importance of 100.0000. This was also the variable which was used to make the first split in the classification tree. So, koi_score was the most important variable in classifying whether an observation was an exoplanet candidate or if it was a false positive.

Conclusion

In this project, we looked at a data set which focused on exoplanet sighting found by the Kepler Space Observatory’s data. This looked at various factors regarding a potential exoplanet sighting, such as its orbital period and transit duration, and whether or not this observation was confirmed to be an exoplanet candidate, or if it was actually just a false positive.

We conducted both unsupervised and supervised learning on this data set. In the unsupervised learning, we ran PCA on the numeric variables in the data set. In this PCA, we found that without scaling the variables, the variable koi_insol had the highest absolute value of its loading value for PC1. This meant that this variable had the most strongly relates to the first principal component. We then ran PCA with scaling the variables. When running PCA with scaling the variables, we found that koi_slogg and koi_teq had the highest absolute value of their loading values for PC1. The variables which had the highest absolute value of their loading values for PC2 were koi_model_snr and koi_depth.

In the supervised learning step, we looked at two main models, an OLS regression model using koi_score as the response variable, and a classification model using koi_pdisposition as the response variable. First, we created an OLS regression model with koi_score as the response variable. We found that this model was statistically significant with a p-value of p < .001. So, this model did statistically signifcantly predict an observation’s koi_score. We ran a few models on this, an OLS model, a LASSO model, and a pruned tree model to see which one resulted in the best performance and lowest test RMSE value. The pruned tree had the lowest test RMSE value, meaning it had the best performance because it resulted in the lowest difference bewteen the actual and the predicted values.

During the supervised learning steps, we also created a classification model with the categorical variable koi_pdisposition as the response variable. We created several models to see which one provided the best accuracy. We looked at KNN with scaling, a classification tree, and an out of bag model. We found that the out of bag model did the best, because it resulted in the highest accuracy percentage. This meant that we correctly classified whether an observation was an exoplanet candidate or a false positive with the highest accuracy by using the out of bag model.

Overall, this project showed that we can create models which predict whether or not an observation is an exoplanet candidate, or just a false positive, with very high accuracy. The out of bag classification model had an accuracy of 97.45%, meaning we created a model which has very good performance in correctly classifying true exoplanet candidates.

This provides useful information, because it can help to determine which observations truly are exoplanet candidates that are worth investigating further to find other planets beyond our solar system.

Some recommendations I would give for future projects include:

Look into potential false positives and false negatives in the classification steps. For instance, perhaps there were some observations classified as exoplanet candidates that later turned out to be false. Or maybe there were some observations labelled as false sightings that later turned out to be true exoplanet candidates.
Consider other models which may provide even better performance in terms of test RMSE and accuracy. We looked at quite a few models in this project both to look into test RMSE and accuracy. However, perhaps there are other potential models which could be created that would provide even better performance than the ones we created.

References

NASA. (2017, October 10). Kepler Exoplanet Search Results. Kaggle. https://www.kaggle.com/datasets/nasa/kepler-exoplanet-search-results

NASA. (n.d.-e). What’s A transit? - NASA science. NASA. https://science.nasa.gov/exoplanets/whats-a-transit/

NASA Exoplanet Archive. NASA Exoplanet Archive. (n.d.). https://exoplanetarchive.ipac.caltech.edu/

Kepler Exoplanet Search: Detecting and Confirming Planets Beyond our Solar System

Josie Gallop

2024-12-03