NASA Exoplanet Archive

Load library

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
library(ggthemes)
library(effsize)
library(GGally)
library(ggplot2)
library(xts)
Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric


######################### Warning from 'xts' package ##########################
#                                                                             #
# The dplyr lag() function breaks how base R's lag() function is supposed to  #
# work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
# source() into this session won't work correctly.                            #
#                                                                             #
# Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
# conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
# dplyr from breaking base R's lag() function.                                #
#                                                                             #
# Code in packages is not affected. It's protected by R's namespace mechanism #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
#                                                                             #
###############################################################################

Attaching package: 'xts'

The following objects are masked from 'package:dplyr':

    first, last
library(tsibble)

Attaching package: 'tsibble'

The following object is masked from 'package:zoo':

    index

The following object is masked from 'package:lubridate':

    interval

The following objects are masked from 'package:base':

    intersect, setdiff, union
library(dplyr)

NASA Exoplanet Archive

Presentation link: Recording-20260427_214911.webm

Load NASA data

nasa_data <- read_delim("C:/Users/imaya/Downloads/cleaned_5250.csv",delim = ",")
Rows: 5250 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): name, planet_type, mass_wrt, radius_wrt, detection_method
dbl (8): distance, stellar_magnitude, discovery_year, mass_multiplier, radiu...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(nasa_data)
# A tibble: 6 × 13
  name     distance stellar_magnitude planet_type discovery_year mass_multiplier
  <chr>       <dbl>             <dbl> <chr>                <dbl>           <dbl>
1 11 Coma…      304              4.72 Gas Giant             2007           19.4 
2 11 Ursa…      409              5.01 Gas Giant             2009           14.7 
3 14 Andr…      246              5.23 Gas Giant             2008            4.8 
4 14 Herc…       58              6.62 Gas Giant             2002            8.14
5 16 Cygn…       69              6.22 Gas Giant             1996            1.78
6 17 Scor…      408              5.23 Gas Giant             2020            4.32
# ℹ 7 more variables: mass_wrt <chr>, radius_multiplier <dbl>,
#   radius_wrt <chr>, orbital_radius <dbl>, orbital_period <dbl>,
#   eccentricity <dbl>, detection_method <chr>

Data Summary

The dataset used in this project is a processed version of data from the NASA Exoplanet Archive. While the original data was collected through various astronomical observations, this version was reorganized by a third-party contributor on Kaggle to improve clarity and usability for analysis.

Each row in the dataset represents a confirmed exoplanet, while each column contains specific information about that planet???s physical and orbital characteristics. The dataset contains approximately 5,250 entries and 13 variables, including distance from Earth, mass, radius, orbital radius, orbital period, eccentricity, and detection method. The dataset includes multiple planet types, such as gas giants, super-Earths, Neptune-like planets, terrestrial planets, and unknown classifications, allowing for comparisons across different categories. The dataset also includes planets discovered across a range of years, enabling analysis of how detection methods and discoveries have evolved over time. The cleaned and structured format improves data organization, making it easier to identify patterns and relationships between variables.

Variable Description

  • name : The name or designation of the exoplanet.
  • distance: The distance from Earth to the star system where the planet is located
  • stellar_magnitude: The brightness of the host star as seen from Earth; lower values indicate brighter stars
  • planet_type: The classification of the planet (e.g., Gas Giant, Terrestrial)
  • discovery_year: The year the planet was discovered
  • mass_multiplier: The mass of the planet relative to a reference planet
  • mass_wrt: The reference unit used for mass comparison
  • radius_multiplier: The radius of the planet relative to a reference planet
  • radius_wrt: The reference unit used for radius comparison
  • orbital_radius: The average distance between the planet and its host star
  • orbital_period: The time it takes for the planet to complete one orbit around its star
  • eccentricity: A measure of how elliptical the planets orbit is, where 0 is a perfect circle and values closer to 1 indicate more elongated orbits
  • detection_method: The technique used to discover the planet (e.g., Radial Velocity, Direct Imaging, Eclipse Timing Variations)

Audience

This analysis is intended for space researchers and individuals interested in astronomy and exoplanet systems. The purpose of this project is to explore similarities and differences between exoplanet systems using observational data.

The audience may include data science teams at space research organizations or educational astronomy platforms. These users are assumed to have a basic understanding of astronomy and data analysis, but are primarily interested in identifying patterns in exoplanet systems rather than detailed astrophysical theory.

Project Information

The International Astronomical Union developed a standardized system for naming exoplanets in order to keep track of the rapidly growing number of discoveries. As detection methods continue to improve, the number of known planets has increased significantly, making a consistent naming convention essential.

Under this system, planets are named after their host star, followed by a lowercase letter that indicates the order in which the planet was discovered. For example, the first planet discovered in a system is labeled b, the second c, and so on. This naming structure allows astronomers to easily identify planets that belong to the same system.

This above image shows the exoplanet HIP 65426 b, captured by the James Webb Space Telescope. It serves as a clear example of how exoplanets are named in relation to their host star. In this case, the star is called HIP 65426, and the planet orbiting it is labeled HIP 65426 b

Image from: https://www.esa.int/ESA_Multimedia/Images/2022/08/Webb_takes_its_first_exoplanet_image

Exploration Question: How do planets within the same system compare in orbital and physical properties?

Project Goals

The goal of this project is to analyze how planets that are closer to their host star compare in their physical and orbital properties, and to determine whether any observed differences can be explained by planetary formation processes.

Current Limitation

A key limitation of this dataset is the uncertainty regarding completeness within individual planetary systems. It is not guaranteed that all planets orbiting a given host star are included, which may introduce observational bias and affect comparisons between planets in the same system. Additionally, detection methods are more likely to identify larger or closer planets, meaning smaller or more distant planets may be underrepresented.

Exploration Data Analysis

This code creates a new column called system_id by using str_replace() to remove the final lowercase letter (such as b or c) from each planet’s name. This group’s planets are from the same star system under a single ID, allowing them to be analyzed together.

nasa_data <- nasa_data |>
mutate(system_id = str_replace(name, " [b-z]$", "")) 

This code groups the data by system_id and counts the number of planets in each system. It then sorts the systems in descending order to identify those with the most planets. To analyze how planets within the same system compare in orbital and physical properties, we would focus on the top four systems with the highest number of exoplanets.

nasa_data |> group_by(system_id)|> summarise(num_planets = n()) |> arrange(desc(num_planets))
# A tibble: 3,951 × 2
   system_id  num_planets
   <chr>            <int>
 1 TRAPPIST-1           7
 2 HD 10180             6
 3 HD 191939            6
 4 HD 219134            6
 5 HD 34445             6
 6 K2-138               6
 7 Kepler-11            6
 8 Kepler-20            6
 9 Kepler-80            6
10 TOI-1136             6
# ℹ 3,941 more rows
top_systems <- nasa_data |>
  filter(system_id %in% c("TRAPPIST-1", "HD 10180", "HD 191939", "HD 219134"))
top_systems |>
  group_by(system_id) |>
  summarise(num_planets = n()) |>
  ggplot(aes(x = reorder(system_id, num_planets), y = num_planets)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Number of Planets per System",
       x = "System",
       y = "Planet Count")

The bar graph shows the top four systems and the number of exoplanets within each system. This highlights differences in planetary abundance across systems.

Exploring discovery year and detection method, while not physical or orbital properties, provides useful context for understanding how exoplanets are identified across different systems and may help explain patterns observed in the data.

top_systems |>
  group_by(system_id) |>
  summarise(
    min_year = min(discovery_year, na.rm = TRUE),
    max_year = max(discovery_year, na.rm = TRUE),
    span_years = max_year - min_year + 1
  ) |>
  arrange(desc(span_years))
# A tibble: 4 × 4
  system_id  min_year max_year span_years
  <chr>         <dbl>    <dbl>      <dbl>
1 HD 191939      2020     2022          3
2 TRAPPIST-1     2016     2017          2
3 HD 10180       2010     2010          1
4 HD 219134      2015     2015          1

The data shows that most exoplanets within each system were discovered within a short time span of one another. For example, systems such as HD 191939 and TRAPPIST-1 have discovery spans of only a few years. This suggests that once an initial exoplanet is detected within a system, it often leads to further discoveries in that same system over a relatively short period.

This pattern may be due to increased observational focus after the first discovery, as astronomers continue to study the same star using similar detection methods. It could also indicate that certain systems are more suitable for detection techniques, making it easier to identify multiple planets once one has already been found.

top_systems |>
group_by(system_id) |>
summarise(
num_methods = n_distinct(detection_method)
) |>
arrange(desc(num_methods)) 
# A tibble: 4 × 2
  system_id  num_methods
  <chr>            <int>
1 HD 191939            2
2 HD 10180             1
3 HD 219134            1
4 TRAPPIST-1           1
top_systems |>
group_by(system_id, detection_method) |>
summarise(count = n()) |>
arrange(system_id, desc(count)) 
`summarise()` has grouped output by 'system_id'. You can override using the
`.groups` argument.
# A tibble: 5 × 3
# Groups:   system_id [4]
  system_id  detection_method count
  <chr>      <chr>            <int>
1 HD 10180   Radial Velocity      6
2 HD 191939  Transit              4
3 HD 191939  Radial Velocity      2
4 HD 219134  Radial Velocity      6
5 TRAPPIST-1 Transit              7

The method exploration investigates the different detection techniques used to discover exoplanets within each system. The results show that most systems rely on a small number of detection methods, with some systems using only one method and others using two. This suggests that once a detection method is effective for a system, it is often used repeatedly to identify additional planets.

The second table compares detection methods within each system and shows how many planets were discovered using each method. For example, some systems have most of their planets detected using Radial Velocity or Transit methods. This highlights that certain detection techniques are more dominant within specific systems.

However, this also suggests a limitation in the data, as some systems may not have been explored using multiple detection methods. As a result, there may be undiscovered planets that could potentially be identified if alternative detection techniques were applied.

ggplot(top_systems, aes(x = system_id, fill = planet_type)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c(
    "Gas Giant" = "#A6CEE3",
    "Terrestrial" = "#FDBF6F",
    "Neptune-like" = "#B2DF8A",
    "Super Earth" = "#CAB2D6",
    "Unknown" = "#D9D9D9"
  )) +
  labs(title = "Planet Type Distribution by System",
       x = "System",
       y = "Proportion")

This code shows the distribution of planet types across different systems. The results suggest that similar planet types may be found within the same system, indicating that planetary composition could be influenced by the conditions present during system formation.

For example, HD 10180 contains a mix of Neptune-like planets and gas giants, which are larger exoplanets. In contrast, TRAPPIST-1 is made up of smaller planets such as terrestrial planets and super-Earths, with no gas giants present. This difference highlights how systems can vary significantly in their planetary structure and composition.

ggplot(top_systems, aes(x = system_id, y = eccentricity)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Orbital Eccentricity by System",
       x = "System",
       y = "Eccentricity")

The box plot shows the distribution of eccentricity for each system. Overall, most systems display relatively similar eccentricity levels, with only a few outliers. Interestingly, HD 191939 and TRAPPIST-1 show consistently low eccentricity values across their planets, suggesting more uniform and stable orbital patterns within these systems.

This may indicate that eccentricity is influenced by the overall properties and formation conditions of a system. For example, gravitational interactions refer to the gravitational forces between planets, which can alter their orbits over time and affect orbital eccentricity. Additionally, the protoplanetary disk, a rotating disk of gas and dust surrounding a young star where planets form, can also influence the structure of a system and how circular or eccentric planetary orbits become.

Analysis and Support

Within-System Variation

This code converts planet mass and radius values into a common unit (Earth equivalents) to allow for direct comparison between exoplanets. Since some values are given relative to Jupiter and others relative to Earth, standardising them ensures consistency across the dataset.

top_systems <- top_systems |>
  mutate(
    mass_est_earth = case_when(
      mass_wrt == "Jupiter" ~ mass_multiplier * 317.8,
      mass_wrt == "Earth" ~ mass_multiplier,
      TRUE ~ NA_real_
    ),
    
    radius_est_earth = case_when(
      radius_wrt == "Jupiter" ~ radius_multiplier * 11.21,
      radius_wrt == "Earth" ~ radius_multiplier,
      TRUE ~ NA_real_
    )
  )

Exoplanet systems Mass

This code compares the distribution of planet masses within each system by calculating the mean, standard deviation, minimum, maximum, and overall range of mass values

top_systems |>
  group_by(system_id) |>
  summarise(
    mean_mass = mean(mass_est_earth, na.rm = TRUE),
    sd_mass = sd(mass_est_earth, na.rm = TRUE),
    min_mass = min(mass_est_earth, na.rm = TRUE),
    max_mass = max(mass_est_earth, na.rm = TRUE),
    mass_range = max_mass - min_mass,
    .groups = "drop"
  )
# A tibble: 4 × 6
  system_id  mean_mass sd_mass min_mass max_mass mass_range
  <chr>          <dbl>   <dbl>    <dbl>    <dbl>      <dbl>
1 HD 10180      26.6    19.4     11.8      64.5       52.8 
2 HD 191939    134.    261.       2.8     660.       657.  
3 HD 219134     25.2    40.8      4.36    108.       104.  
4 TRAPPIST-1     0.921   0.451    0.326     1.37       1.05

The results indicate that there are clear differences in mass variation between systems. For example, TRAPPIST-1 has a very small mass range and low standard deviation, suggesting its planets are relatively similar in size. In contrast, systems such as HD 191939 show a much larger range and higher standard deviation, indicating a wide variation in planet masses within the system.

Exoplanet Systems Radius

This code compares the distribution of planet radius within each system by calculating the mean, standard deviation, minimum, maximum, and overall range of radius values.

top_systems |>
  group_by(system_id) |>
  summarise(
    mean_radius = mean(radius_est_earth, na.rm = TRUE),
    sd_radius = sd(radius_est_earth, na.rm = TRUE),
    min_radius = min(radius_est_earth, na.rm = TRUE),
    max_radius = max(radius_est_earth, na.rm = TRUE),
    radius_range = max_radius - min_radius
  )
# A tibble: 4 × 6
  system_id  mean_radius sd_radius min_radius max_radius radius_range
  <chr>            <dbl>     <dbl>      <dbl>      <dbl>        <dbl>
1 HD 10180         5.35      2.14       3.45        9.39        5.94 
2 HD 191939        6.61      5.09       2.99       13.3        10.3  
3 HD 219134        3.68      4.51       1.31       12.8        11.5  
4 TRAPPIST-1       0.979     0.158      0.755       1.13        0.374

The results show differences in radius variation between systems. TRAPPIST-1 has very low mean radius and a very small standard deviation and range, indicating that its planets are consistently small and very similar in size. In contrast, systems such as HD 219134 and HD 191939 show much larger ranges and higher standard deviations, suggesting a wider variation in planet sizes within those systems.

These differences in planetary radius may be influenced by variations in planet type and mass within each system. For example, gas giants tend to have much larger radii compared to terrestrial planets, and higher-mass planets are generally associated with larger sizes. As a result, systems containing more gas giants are likely to show greater variation in radius, while systems dominated by smaller, rocky planets tend to be more uniform.

top_systems |>
  ggplot(aes(x = system_id, y = mass_est_earth)) +
  geom_boxplot() +
  labs(title = "Mass Variation Within Each Planetary System")

top_systems |>
  ggplot(aes(x = system_id, y = radius_est_earth)) +
  geom_boxplot() +
  labs(title = "Radius Variation Within Each Planetary System")

ggplot(top_systems, aes(x = mass_est_earth, y = radius_est_earth, color = system_id)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Relationship Between Mass and Radius by System",
       x = "Mass (Earth units)",
       y = "Radius (Earth units)")
`geom_smooth()` using formula = 'y ~ x'

The data demonstrates a positive correlation between mass and radius across all observed planetary systems; as a planet’s mass increases, its radius generally increases as well. However, the strength of this relationship varies between systems, as shown by differences in the slope of the trend. This suggests that while the mass, radius relationship is a general pattern, the specific composition and density of planets may be influenced by the conditions within their individual planetary systems.

This analysis supports the research question by showing that there is a consistent positive relationship between mass and radius across different systems. This suggests that, regardless of the system, larger planets tend to have greater radius, indicating a general physical relationship between these properties.

However, the variation in the strength of this relationship between systems shows that planetary characteristics are not identical across systems. Instead, differences in composition, density, and formation conditions may cause planets within and between systems to behave differently.

Regression Models

model_orbitt1 <- lm(orbital_radius ~ radius_est_earth, data = top_systems)
summary(model_orbitt1)

Call:
lm(formula = orbital_radius ~ radius_est_earth, data = top_systems)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.07765 -0.24328  0.03922  0.15236  1.65670 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -0.25450    0.18686  -1.362    0.186    
radius_est_earth  0.21065    0.03353   6.282 2.07e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6456 on 23 degrees of freedom
Multiple R-squared:  0.6318,    Adjusted R-squared:  0.6158 
F-statistic: 39.46 on 1 and 23 DF,  p-value: 2.074e-06

A linear regression model was used to investigate whether larger planets tend to be further from their star by examining the relationship between planet radius and orbital radius.

The results show a statistically significant positive relationship between planet radius and orbital radius, with an R² value of 0.63, showing that planet size explains a large amount of the differences in orbital radius. This suggests that larger planets in the dataset are generally found at greater distances from their star.

Relating this back to the research question, the findings suggest that there are consistent physical, orbital relationships within planetary systems, where planet size is associated with orbital distance. This indicates that while planets within systems may differ in size, their physical properties still follow observable patterns.

model_orbitt2 <- lm(orbital_period ~ mass_est_earth, data = top_systems)
summary(model_orbitt2)

Call:
lm(formula = orbital_period ~ mass_est_earth, data = top_systems)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2798 -0.4871 -0.4495 -0.3713  4.9142 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.448782   0.323756   1.386 0.178987    
mass_est_earth 0.009875   0.002369   4.168 0.000371 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.529 on 23 degrees of freedom
Multiple R-squared:  0.4302,    Adjusted R-squared:  0.4055 
F-statistic: 17.37 on 1 and 23 DF,  p-value: 0.0003708

A linear regression model was used to investigate whether planet mass is related to orbital period across the top systems.

The results show a statistically significant positive relationship between mass and orbital period. The R² value of 0.43 indicates that planet mass has a moderate influence on orbital period, suggesting that physical properties may be linked to orbital behavior within planetary systems.

This relationship may reflect how physical properties are linked to orbital behavior within planetary systems. It connects back to the main research question by showing that while planets within systems vary in their physical characteristics, these properties can also be associated with differences in how they orbit their stars.

model_er <- lm(eccentricity ~ radius_est_earth, data = top_systems)
summary(model_er)

Call:
lm(formula = eccentricity ~ radius_est_earth, data = top_systems)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.06682 -0.03989 -0.02315  0.01648  0.20479 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)  
(Intercept)      0.048455   0.018805   2.577   0.0169 *
radius_est_earth 0.001377   0.003375   0.408   0.6871  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06497 on 23 degrees of freedom
Multiple R-squared:  0.007183,  Adjusted R-squared:  -0.03598 
F-statistic: 0.1664 on 1 and 23 DF,  p-value: 0.6871

A linear regression model was used to investigate whether planet radius is related to orbital eccentricity across the top systems.

The results show no statistically significant relationship between planet radius and eccentricity (p = 0.687). The R² value of 0.007 indicates that planet radius has almost no influence on eccentricity. This suggests that planet size does not have a meaningful influence on how elliptical a planet’s orbit is in this dataset.

Relating this back to the research question, the findings suggest that while planets within systems may vary in physical size, orbital shape is likely influenced by other factors such as gravitational interactions or system formation conditions rather than planet radius.

Conclusion

In conclusion, the data provide evidence that planetary systems show relationships between the physical and orbital properties of exoplanets. Several variables, including mass, radius, orbital radius, and orbital period, display statistically significant relationships, while others, such as radius and eccentricity, show little to no association.

Overall, the results suggest that certain planetary properties are strongly related, particularly physical size and orbital characteristics, while other relationships are weak or not present. This indicates that exoplanet properties are not entirely independent within systems, but instead are influenced by formation and development within the protoplanetary disk and their position within the system.

These patterns may be linked to differences in planetary formation environments, particularly the protoplanetary disk. As this disk provides the material from which planets form, variations in its composition and structure could help explain the differences in planetary types and physical properties observed across systems.

Overall, while exoplanets vary significantly between systems, the findings highlight consistent relationships between certain physical and orbital properties, suggesting that system-level formation processes play an important role in shaping planetary characteristics.