Data Dive Ten

GLMs

Load library

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

Load NASA data

nasa_data <- read_delim("C:/Users/imaya/Downloads/cleaned_5250.csv",delim = ",")
## Rows: 5250 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, planet_type, mass_wrt, radius_wrt, detection_method
## dbl (8): distance, stellar_magnitude, discovery_year, mass_multiplier, radiu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(nasa_data)
## # A tibble: 6 × 13
##   name     distance stellar_magnitude planet_type discovery_year mass_multiplier
##   <chr>       <dbl>             <dbl> <chr>                <dbl>           <dbl>
## 1 11 Coma…      304              4.72 Gas Giant             2007           19.4 
## 2 11 Ursa…      409              5.01 Gas Giant             2009           14.7 
## 3 14 Andr…      246              5.23 Gas Giant             2008            4.8 
## 4 14 Herc…       58              6.62 Gas Giant             2002            8.14
## 5 16 Cygn…       69              6.22 Gas Giant             1996            1.78
## 6 17 Scor…      408              5.23 Gas Giant             2020            4.32
## # ℹ 7 more variables: mass_wrt <chr>, radius_multiplier <dbl>,
## #   radius_wrt <chr>, orbital_radius <dbl>, orbital_period <dbl>,
## #   eccentricity <dbl>, detection_method <chr>

Creating a Binary Variable

nasa_data_clean <- nasa_data |>
  drop_na(eccentricity, stellar_magnitude, planet_type)

top_method <- nasa_data_clean %>% count(detection_method, sort =TRUE) %>% slice_head(n=1)
print(top_method)
## # A tibble: 1 × 2
##   detection_method     n
##   <chr>            <int>
## 1 Transit           3943
nasa_data_clean$Transit <- ifelse(nasa_data_clean$detection_method == "Transit", 1, 0)

The detection method was converted into a binary variable indicating whether a planet was discovered using the transit method (1) or another method (0). The transit method was chosen as the reference category because it has identified over 75% of known exoplanets (Exoplanet Detection Methods | The Schools’ Observatory, 2023). This method detects planets when they pass in front of their host star, causing a periodic dip in the observed light. Below is an image from Exoplanet Detection Methods illustrating how the transit method works.

Reference

Exoplanet Detection Methods | The Schools’ Observatory. (2023). Schoolsobservatory.org. https://www.schoolsobservatory.org/learn/space/exoplanets/detection-methods

Logistic Regression Model

I built a logistic regression model to explore how orbital eccentricity, stellar magnitude, and planet type influence the likelihood that an exoplanet is detected using the Transit method.

model <- glm(Transit ~ eccentricity + stellar_magnitude + planet_type, data = nasa_data_clean, family =binomial)

summary(model)
## 
## Call:
## glm(formula = Transit ~ eccentricity + stellar_magnitude + planet_type, 
##     family = binomial, data = nasa_data_clean)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -6.47026    0.26618 -24.308  < 2e-16 ***
## eccentricity             -4.33143    0.38449 -11.265  < 2e-16 ***
## stellar_magnitude         0.65887    0.02339  28.169  < 2e-16 ***
## planet_typeNeptune-like   0.79374    0.12711   6.245 4.25e-10 ***
## planet_typeSuper Earth    1.30399    0.15797   8.254  < 2e-16 ***
## planet_typeTerrestrial    2.17998    0.49528   4.402 1.07e-05 ***
## planet_typeUnknown      -17.90729  364.43246  -0.049    0.961    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5429.0  on 5088  degrees of freedom
## Residual deviance: 2274.8  on 5082  degrees of freedom
## AIC: 2288.8
## 
## Number of Fisher Scoring iterations: 13
coef(summary(model))
##                            Estimate   Std. Error      z value      Pr(>|z|)
## (Intercept)              -6.4702557   0.26617896 -24.30791618 1.616703e-130
## eccentricity             -4.3314323   0.38449097 -11.26536804  1.945374e-29
## stellar_magnitude         0.6588730   0.02339034  28.16859753 1.418510e-174
## planet_typeNeptune-like   0.7937405   0.12710806   6.24461190  4.248537e-10
## planet_typeSuper Earth    1.3039944   0.15797486   8.25444241  1.526130e-16
## planet_typeTerrestrial    2.1799827   0.49527548   4.40155594  1.074774e-05
## planet_typeUnknown      -17.9072869 364.43246436  -0.04913746  9.608097e-01

The logistic regression shows that stellar magnitude has a positive coefficient (0.659). Stellar magnitude measures the brightness of a star, where higher values mean dimmer stars. The positive coefficient indicates that as stellar magnitude increases, the probability of detecting a planet via the Transit method also increases. This may be due to observational biases or the way Transit detections are measured.

Eccentricity has a negative coefficient (-4.33). Eccentricity measures how elliptical a planet’s orbit is, with 0 being a perfect circle. The negative coefficient suggests that planets with more elliptical orbits are less likely to be detected via Transit, likely because circular orbits increase the chance that a planet crosses directly in front of its star from our line of sight.

Planet type also influences detection. The model compares different types of planets to Gas Giants, which are the reference category. Neptune-like, Super Earth, and Terrestrial planets have positive coefficients, meaning they are more likely to be detected via Transit than Gas Giants. Gas Giants do not appear with a separate coefficient because they are the baseline for comparison. Different planet sizes and compositions affect how easily they block their star’s light when using the transit detection method.

exp(coef(model))
##             (Intercept)            eccentricity       stellar_magnitude 
##            1.548830e-03            1.314870e-02            1.932613e+00 
## planet_typeNeptune-like  planet_typeSuper Earth  planet_typeTerrestrial 
##            2.211654e+00            3.683982e+00            8.846154e+00 
##      planet_typeUnknown 
##            1.670952e-08

Exponentiating the logistic regression coefficients gives odds ratios. The exponentiated intercept (0.00155) represents the baseline odds of detecting a planet via the transit method when all predictors are zero. Eccentricity has an odds ratio of 0.013, indicating that planets with more elliptical orbits are much less likely to be detected. Stellar magnitude has an odds ratio of 1.93, suggesting that higher magnitude stars are associated with increased detection odds; however, this result is counterintuitive and may reflect dataset bias or variable coding. Planet type also influences detection: Neptune-like, Super Earth, and Terrestrial planets have higher odds of detection compared to Gas Giants

Build Confidence Level

coef_stella <- coef(summary(model))["stellar_magnitude", "Estimate"]
se_stella   <- coef(summary(model))["stellar_magnitude", "Std. Error"]

ci_low <- coef_stella - (1.96 * se_stella)
ci_up  <- coef_stella + (1.96 * se_stella)

exp(c(ci_low,ci_up))
## [1] 1.846013 2.023276

The 95% confidence interval for the odds ratio of stellar magnitude is 1.85 to 2.02. This means we are 95% confident that this interval captures the true odds ratio. For each 1-unit increase in stellar magnitude, the odds of detecting a planet via the transit method increase by approximately 85% to 102%. However, since higher stellar magnitude corresponds to dimmer stars, this result may reflect dataset bias.