Exoplanets

In this visualization/regression project, I will attempt to correlate relationships between aspects of planets and stars.

I am using the Open Exoplanet Catalogue from kaggle.com: https://www.kaggle.com/mrisdal/open-exoplanet-catalogue

Variable definitions:

PlanetaryMassJpt: Mass of the planet (measured in Jupiter masses)

SurfaceTempK: Surface Temperature of the Planet (in degrees Kelvin)

DiscoveryMethod: Method by which planet was discovered

HostStarMassSlrMass: Mass of the star (measured in solar masses)

HostStarMetallicity: Measure of the abundance of heavy elements present inside the star (measured in Fe/H)

If Fe/H=0, the star has the same abundance of iron as the Sun. If Fe/H = -1, it has 1/10th of the sun’s value and vice versa.

HostStarTempK: Temperature of the Star (measured in degrees Kelvin)

SemiMajorAxisAU: The major axis of an orbit at its longest diameter (measured in AU(distance from Earth to Sun))

Load necessary packages

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(readr)
library(ggplot2)
library(dplyr)
library(ggthemes)
library(ggrepel)
library(rgl)
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(knitr)

Link .csv file in directory to R Studio

setwd("C:/Users/tycho/Desktop/DATA101")
exoplanets <- read_csv("oec.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   PlanetIdentifier = col_character(),
##   AgeGyr = col_logical(),
##   DiscoveryMethod = col_character(),
##   LastUpdated = col_character(),
##   RightAscension = col_character(),
##   Declination = col_character(),
##   ListsPlanetIsOn = col_character()
## )
## i Use `spec()` for the full column specifications.
## Warning: 2 parsing failures.
##  row    col           expected actual      file
## 1681 AgeGyr 1/0/T/F/TRUE/FALSE 0.0055 'oec.csv'
## 2916 AgeGyr 1/0/T/F/TRUE/FALSE 3      'oec.csv'
head(exoplanets)
## # A tibble: 6 x 25
##   PlanetIdentifier TypeFlag PlanetaryMassJpt RadiusJpt PeriodDays
##   <chr>               <dbl>            <dbl>     <dbl>      <dbl>
## 1 HD 143761 b             0           1.04      NA         39.8  
## 2 HD 143761 c             0           0.079     NA        103.   
## 3 KOI-1843.03             0           0.0014     0.054      0.177
## 4 KOI-1843.01             0          NA          0.114      4.19 
## 5 KOI-1843.02             0          NA          0.071      6.36 
## 6 Kepler-9 b              0           0.25       0.84      19.2  
## # ... with 20 more variables: SemiMajorAxisAU <dbl>, Eccentricity <dbl>,
## #   PeriastronDeg <dbl>, LongitudeDeg <dbl>, AscendingNodeDeg <dbl>,
## #   InclinationDeg <dbl>, SurfaceTempK <dbl>, AgeGyr <lgl>,
## #   DiscoveryMethod <chr>, DiscoveryYear <dbl>, LastUpdated <chr>,
## #   RightAscension <chr>, Declination <chr>, DistFromSunParsec <dbl>,
## #   HostStarMassSlrMass <dbl>, HostStarRadiusSlrRad <dbl>,
## #   HostStarMetallicity <dbl>, HostStarTempK <dbl>, HostStarAgeGyr <dbl>,
## #   ListsPlanetIsOn <chr>
# View & load the data
view(exoplanets)
exoplanets
## # A tibble: 3,584 x 25
##    PlanetIdentifier TypeFlag PlanetaryMassJpt RadiusJpt PeriodDays
##    <chr>               <dbl>            <dbl>     <dbl>      <dbl>
##  1 HD 143761 b             0           1.04      NA         39.8  
##  2 HD 143761 c             0           0.079     NA        103.   
##  3 KOI-1843.03             0           0.0014     0.054      0.177
##  4 KOI-1843.01             0          NA          0.114      4.19 
##  5 KOI-1843.02             0          NA          0.071      6.36 
##  6 Kepler-9 b              0           0.25       0.84      19.2  
##  7 Kepler-9 c              0           0.17       0.82      39.0  
##  8 Kepler-9 d              0           0.022      0.147      1.59 
##  9 GJ 160.2 b              0           0.0321    NA          5.24 
## 10 Kepler-566 b            0          NA          0.192     18.4  
## # ... with 3,574 more rows, and 20 more variables: SemiMajorAxisAU <dbl>,
## #   Eccentricity <dbl>, PeriastronDeg <dbl>, LongitudeDeg <dbl>,
## #   AscendingNodeDeg <dbl>, InclinationDeg <dbl>, SurfaceTempK <dbl>,
## #   AgeGyr <lgl>, DiscoveryMethod <chr>, DiscoveryYear <dbl>,
## #   LastUpdated <chr>, RightAscension <chr>, Declination <chr>,
## #   DistFromSunParsec <dbl>, HostStarMassSlrMass <dbl>,
## #   HostStarRadiusSlrRad <dbl>, HostStarMetallicity <dbl>, HostStarTempK <dbl>,
## #   HostStarAgeGyr <dbl>, ListsPlanetIsOn <chr>

Examine the dataframe structure.

str(exoplanets)
## spec_tbl_df [3,584 x 25] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ PlanetIdentifier    : chr [1:3584] "HD 143761 b" "HD 143761 c" "KOI-1843.03" "KOI-1843.01" ...
##  $ TypeFlag            : num [1:3584] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PlanetaryMassJpt    : num [1:3584] 1.045 0.079 0.0014 NA NA ...
##  $ RadiusJpt           : num [1:3584] NA NA 0.054 0.114 0.071 0.84 0.82 0.147 NA 0.192 ...
##  $ PeriodDays          : num [1:3584] 39.846 102.54 0.177 4.195 6.356 ...
##  $ SemiMajorAxisAU     : num [1:3584] 0.2196 0.4123 0.0048 0.039 0.052 ...
##  $ Eccentricity        : num [1:3584] 0.037 0.05 NA NA NA 0.0626 0.0684 NA 0.06 NA ...
##  $ PeriastronDeg       : num [1:3584] 271 190 NA NA NA ...
##  $ LongitudeDeg        : num [1:3584] NA NA NA NA NA NA NA NA NA NA ...
##  $ AscendingNodeDeg    : num [1:3584] NA NA NA NA NA NA NA NA NA NA ...
##  $ InclinationDeg      : num [1:3584] NA NA 72 89.4 88.2 ...
##  $ SurfaceTempK        : num [1:3584] NA NA NA NA NA ...
##  $ AgeGyr              : logi [1:3584] NA NA NA NA NA NA ...
##  $ DiscoveryMethod     : chr [1:3584] "RV" "RV" "transit" "transit" ...
##  $ DiscoveryYear       : num [1:3584] 2016 2016 2012 NA NA ...
##  $ LastUpdated         : chr [1:3584] "16/07/11" "16/07/11" "13/07/15" NA ...
##  $ RightAscension      : chr [1:3584] "16 01 03" "16 01 03" "19 00 03.14" "19 00 03.14" ...
##  $ Declination         : chr [1:3584] "+33 18 13" "+33 18 13" "+40 13 14.7" "+40 13 14.7" ...
##  $ DistFromSunParsec   : num [1:3584] 17.2 17.2 NA NA NA ...
##  $ HostStarMassSlrMass : num [1:3584] 0.889 0.889 0.46 0.46 0.46 1.07 1.07 1.07 0.69 0.83 ...
##  $ HostStarRadiusSlrRad: num [1:3584] 1.36 1.36 0.45 0.45 0.45 ...
##  $ HostStarMetallicity : num [1:3584] -0.31 -0.31 0 0 0 0.12 0.12 0.12 NA -0.01 ...
##  $ HostStarTempK       : num [1:3584] 5627 5627 3584 3584 3584 ...
##  $ HostStarAgeGyr      : num [1:3584] NA NA NA NA NA NA NA NA NA NA ...
##  $ ListsPlanetIsOn     : chr [1:3584] "Confirmed planets" "Confirmed planets" "Controversial" "Controversial" ...
##  - attr(*, "problems")= tibble [2 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ row     : int [1:2] 1681 2916
##   ..$ col     : chr [1:2] "AgeGyr" "AgeGyr"
##   ..$ expected: chr [1:2] "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE"
##   ..$ actual  : chr [1:2] "0.0055" "3"
##   ..$ file    : chr [1:2] "'oec.csv'" "'oec.csv'"
##  - attr(*, "spec")=
##   .. cols(
##   ..   PlanetIdentifier = col_character(),
##   ..   TypeFlag = col_double(),
##   ..   PlanetaryMassJpt = col_double(),
##   ..   RadiusJpt = col_double(),
##   ..   PeriodDays = col_double(),
##   ..   SemiMajorAxisAU = col_double(),
##   ..   Eccentricity = col_double(),
##   ..   PeriastronDeg = col_double(),
##   ..   LongitudeDeg = col_double(),
##   ..   AscendingNodeDeg = col_double(),
##   ..   InclinationDeg = col_double(),
##   ..   SurfaceTempK = col_double(),
##   ..   AgeGyr = col_logical(),
##   ..   DiscoveryMethod = col_character(),
##   ..   DiscoveryYear = col_double(),
##   ..   LastUpdated = col_character(),
##   ..   RightAscension = col_character(),
##   ..   Declination = col_character(),
##   ..   DistFromSunParsec = col_double(),
##   ..   HostStarMassSlrMass = col_double(),
##   ..   HostStarRadiusSlrRad = col_double(),
##   ..   HostStarMetallicity = col_double(),
##   ..   HostStarTempK = col_double(),
##   ..   HostStarAgeGyr = col_double(),
##   ..   ListsPlanetIsOn = col_character()
##   .. )

Temperature Box plots for Host Stars and Exoplanets. Box plots are useful when we want to look at the overall distribution for a variable. In particular, they are useful to find outliers which we can discard to get a better picture of future data visualizations.

  boxplot1 <- ggplot(exoplanets) +
  geom_boxplot(aes(x = HostStarTempK)) +
  xlab("Star Temperature (degrees K)") +
  ggtitle("Star Temperature Distribution")


  boxplot2 <- ggplot(exoplanets) +
  geom_boxplot(aes(x = SurfaceTempK)) +
  xlab("Exoplanet Temperature (in degrees K)") +
  ggtitle("Exoplanet Temperature Distribution")
  
  grid.arrange(boxplot1, boxplot2, ncol = 2)
## Warning: Removed 129 rows containing non-finite values (stat_boxplot).
## Warning: Removed 2843 rows containing non-finite values (stat_boxplot).

Mass Box plots for Host Stars and Exoplanets

# Star Mass Box plot
  boxplot3 <- ggplot(exoplanets) +
  geom_boxplot(aes(x = HostStarMassSlrMass)) +
  xlab("Star Mass (in Solar Masses)") +
  ggtitle("Star Mass Distribution")

# Exoplanet Mass Box plot
  boxplot4 <-  ggplot(exoplanets) +
  geom_boxplot(aes(x = PlanetaryMassJpt)) +
  xlab("Exoplanet Mass (in Jupiter Masses)") +
  ggtitle("Exoplanet Mass Distribution")

grid.arrange(boxplot3, boxplot4, ncol = 2)
## Warning: Removed 168 rows containing non-finite values (stat_boxplot).
## Warning: Removed 2271 rows containing non-finite values (stat_boxplot).

Here, we compare star mass and planet mass and try to see if we can find a correlation. For my legend, I used DiscoveryMethod as I thought it would be interesting to see where on the graph planets are located based on how it was discovered. For example, direct imaging has discovered mostly higher mass planets orbiting low mass stars.

https://earthsky.org/space/how-do-astronomers-discover-exoplanets/

mass_scatter <- ggplot(exoplanets, aes(x = PlanetaryMassJpt, y = HostStarMassSlrMass,)) + 
  ggtitle("Planet Mass vs Star Mass") + 
  xlab("Planet Mass (in Jupiter Masses)") + 
  ylab("Star Mass (in Solar Masses") + 
  geom_point(aes(color=DiscoveryMethod), size = 0.5, alpha = 0.5)
ggplotly(mass_scatter)

There are a number of outliers which makes it hard to see trends in the data. Here, we only include planets with 1/4th Jupiter Mass or smaller.

e_small <- exoplanets %>% 
  filter(PlanetaryMassJpt<0.25)
  
e_small %>%  
  
  ggplot(aes(x = PlanetaryMassJpt, y = HostStarMassSlrMass)) + 
  ggtitle("Planet Mass vs Star Mass") + 
  xlab("Planet Mass (in Jupiter Masses)") + 
  ylab("Star Mass (in Solar Masses") + 
  geom_point(aes(color=DiscoveryMethod), size = 1, alpha = 0.5) + 
  geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 43 rows containing non-finite values (stat_smooth).
## Warning: Removed 43 rows containing missing values (geom_point).

The chart shows a slight positive trend: higher mass stars tend to have higher mass planets orbiting them vs lower mass stars.

Now lets compare Semi-major axis with Surface temperature.

e_tempdist <- exoplanets %>% 
  
  filter(SemiMajorAxisAU<10)
  
e_tempdist %>%  
  
  ggplot(aes(x = SurfaceTempK, y = SemiMajorAxisAU)) + 
  ggtitle("Planet Temperature vs Distance from Star") + 
  xlab("Planet Temperature (in degrees K)") + 
  ylab("Semi-major Axis (in AU)") + 
  geom_point(aes(color=DiscoveryMethod), size = 1, alpha = 0.5) + 
  geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 679 rows containing non-finite values (stat_smooth).
## Warning: Removed 679 rows containing missing values (geom_point).

From this, we can see a negative correlation between Temperature and Distance from host star. Upon further inspection, this isn’t an accurate conclusion we can make. There are many factors other than Distance from star that impact planet Temperature. A good example of this is Venus: it is further away from the sun than Mercury is, but is hotter due to it’s thick atmosphere of carbon dioxide. Many such planets can exist orbiting other stars. Another consideration to make is the host star itself. Stars vary in temperature, mass and size: any one of those factors can influence planet temperature.

Next, lets see if we can find a correlation between star mass and star metallicity.

e <- exoplanets %>% 
  filter(SemiMajorAxisAU<100000)
  
e %>%  
  
  ggplot(aes(x = HostStarMassSlrMass, y = HostStarMetallicity)) + 
  ggtitle("Star Mass vs Star Metallicity") + 
  xlab("Star Mass (in Stellar Masses)") + 
  ylab("Star Metallicity (in [Fe/H]") + 
  geom_point(aes(color=DiscoveryMethod), size = 1, alpha = 0.5) + 
  geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 255 rows containing non-finite values (stat_smooth).
## Warning: Removed 255 rows containing missing values (geom_point).

From this we can see a slight positive correlation between star mass and star metallicity. By the data we have, more massive stars tend to have higher metallicities. A better correlation to prove would be star age and star metallicity, however it would be difficult to show as stars are at various stages of their lifetimes and even the time period in the universe’s history can have an effect on the metallicity as there were times when the universe was extremely metal poor and some stars from that time period are still around today.

Lets go further and perform a regression analysis on Star Mass vs Star Metallicity:

regstar <- lm(HostStarMetallicity ~ HostStarMassSlrMass, exoplanets)
summary(regstar)
## 
## Call:
## lm(formula = HostStarMetallicity ~ HostStarMassSlrMass, data = exoplanets)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.08183 -0.05976  0.00295  0.08451  0.54777 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.09986    0.01312  -7.611 3.84e-14 ***
## HostStarMassSlrMass  0.11462    0.01238   9.257  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1899 on 2505 degrees of freedom
##   (1077 observations deleted due to missingness)
## Multiple R-squared:  0.03308,    Adjusted R-squared:  0.03269 
## F-statistic: 85.69 on 1 and 2505 DF,  p-value: < 2.2e-16

From the regression model, we have two variables: intercept and HostStarSlrMass. Regressing Metallicity on Star Mass gave us an estimate of -0.09986 for the intercept and 0.11462 for HostStarSlrMass. The standard errors for both are <0.05 indicating that the data is closely distributed around the mean. The median is quite close to 0 while min is around 4 times the magnitude of the max. The adjusted R squared value is 0.03269 which suggests that even though previous indicators showed a trend, the vast majority of the data points do not necessarily fall on or close to the trend line. The p-value (2.2e-16) is very close to 0 which shows that the result is of major practical importance. Since the p-value is less than 0.05, we can reject the null hypothesis.

a <- filter(exoplanets, HostStarMassSlrMass < 1)
summarize(a, Average = mean(HostStarMetallicity, na.rm = T))
## # A tibble: 1 x 1
##   Average
##     <dbl>
## 1 -0.0329
b <- filter(exoplanets, HostStarMassSlrMass > 1)
summarize(b, Average = mean(HostStarMetallicity, na.rm = T))
## # A tibble: 1 x 1
##   Average
##     <dbl>
## 1  0.0688

And as we can see above, the average Star masses fall in-line with the positive correlation as seen on the graph.

Likewise, we can also perform regression for Star Mass vs Planet Mass

regmass <- lm(HostStarMassSlrMass ~ PlanetaryMassJpt, exoplanets)
summary(regmass)
## 
## Call:
## lm(formula = HostStarMassSlrMass ~ PlanetaryMassJpt, data = exoplanets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1428 -0.1928 -0.0134  0.1691  3.3832 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.039482   0.012847  80.912  < 2e-16 ***
## PlanetaryMassJpt 0.005526   0.001212   4.558 5.67e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4359 on 1239 degrees of freedom
##   (2343 observations deleted due to missingness)
## Multiple R-squared:  0.01649,    Adjusted R-squared:  0.0157 
## F-statistic: 20.78 on 1 and 1239 DF,  p-value: 5.673e-06

In this model, the standard errors are lower which means that the sample is closely distributed around the mean, however the r squared value is lower which implies that the linear model is not if at all better than the model using the mean.

Summary: I chose the exoplanet dataset for this project. This dataset lists all known exoplanets (as of June 8th, 2017) and their characteristics such as mass, surface temperature, eccentricity, as well as characteristics about their host stars such as temperature and metallicity. The data is from kaggle.com, a website where many other datasets are available.

It’s important to note that while this dataset is useful to find correlations between planets and their host stars, this dataset is fundamentally biased in that our technology is just not advanced enough to detect certain types of planets orbiting certain stars. Naturally, the dataset is biased towards higher mass planets orbiting lower mass stars since those planets are easier to detect. We can see this in the visualization. In the Planet Temperature vs Distance from Star graph, we find that the majority of planets discovered are by the transit method and those planets are very close to their stars (<0.5AU) in tight orbits. In reality there would likely be many more smaller planets orbiting stars that are just too faint or dark, or get lost in the glare of it’s host star. As our technology gets better and we get better at detecting exoplanets, we will get a better view of the correlations that we made for this dataset.