library(tidyverse)
library(ggrepel)
library(ggthemes)
library(effsize)
library(GGally)
library(ggplot2)
library(xts)
library(tsibble)
library(dplyr)
library(lindia)
library(broom)NASA Exoplanet Archive
Load library
NASA Exoplanet Archive
Load NASA data
nasa_data <- read_delim("C:/Users/imaya/Downloads/cleaned_5250.csv",delim = ",")head(nasa_data)# A tibble: 6 × 13
name distance stellar_magnitude planet_type discovery_year mass_multiplier
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 11 Coma… 304 4.72 Gas Giant 2007 19.4
2 11 Ursa… 409 5.01 Gas Giant 2009 14.7
3 14 Andr… 246 5.23 Gas Giant 2008 4.8
4 14 Herc… 58 6.62 Gas Giant 2002 8.14
5 16 Cygn… 69 6.22 Gas Giant 1996 1.78
6 17 Scor… 408 5.23 Gas Giant 2020 4.32
# ℹ 7 more variables: mass_wrt <chr>, radius_multiplier <dbl>,
# radius_wrt <chr>, orbital_radius <dbl>, orbital_period <dbl>,
# eccentricity <dbl>, detection_method <chr>
Data Summary
The dataset used in this project is a processed version of data from the NASA Exoplanet Archive. While the original data was collected through various astronomical observations, this version was reorganized by a third-party contributor on Kaggle to improve clarity and usability for analysis.
Each row in the dataset represents a confirmed exoplanet, while each column contains specific information about that planets physical and orbital characteristics. The dataset contains approximately 5,250 entries and 13 variables, including distance from Earth, mass, radius, orbital radius, orbital period, eccentricity, and detection method. The dataset includes multiple planet types, such as gas giants, super-Earths, Neptune-like planets, terrestrial planets, and unknown classifications, allowing for comparisons across different categories. The dataset also includes planets discovered across a range of years, enabling analysis of how detection methods and discoveries have evolved over time. The cleaned and structured format improves data organization, making it easier to identify patterns and relationships between variables.
Variable Description
- name : The name or designation of the exoplanet.
- distance: The distance from Earth to the star system where the planet is located
- stellar_magnitude: The brightness of the host star as seen from Earth; lower values indicate brighter stars
- planet_type: The classification of the planet (e.g., Gas Giant, Terrestrial)
- discovery_year: The year the planet was discovered
- mass_multiplier: The mass of the planet relative to a reference planet
- mass_wrt: The reference unit used for mass comparison
- radius_multiplier: The radius of the planet relative to a reference planet
- radius_wrt: The reference unit used for radius comparison
- orbital_radius: The average distance between the planet and its host star
- orbital_period: The time it takes for the planet to complete one orbit around its star
- eccentricity: A measure of how elliptical the planets orbit is, where 0 is a perfect circle and values closer to 1 indicate more elongated orbits
- detection_method: The technique used to discover the planet (e.g., Radial Velocity, Direct Imaging, Eclipse Timing Variations)
Audience
This analysis is intended for space researchers and individuals interested in astronomy and exoplanet systems. The purpose of this project is to explore similarities and differences between exoplanet systems using observational data.
The audience may include data science teams at space research organizations or educational astronomy platforms. These users are assumed to have a basic understanding of astronomy and data analysis, but are primarily interested in identifying patterns in exoplanet systems rather than detailed astrophysical theory.
Project Information
The International Astronomical Union developed a standardized system for naming exoplanets in order to keep track of the rapidly growing number of discoveries. As detection methods continue to improve, the number of known planets has increased significantly, making a consistent naming convention essential.
Under this system, planets are named after their host star, followed by a lowercase letter that indicates the order in which the planet was discovered. For example, the first planet discovered in a system is labeled b, the second c, and so on. This naming structure allows astronomers to easily identify planets that belong to the same system.
This above image shows the exoplanet HIP 65426 b, captured by the James Webb Space Telescope. It serves as a clear example of how exoplanets are named in relation to their host star. In this case, the star is called HIP 65426, and the planet orbiting it is labeled HIP 65426 b
Image from: https://www.esa.int/ESA_Multimedia/Images/2022/08/Webb_takes_its_first_exoplanet_image
Exploration Question: How do planets within the same system compare in orbital and physical properties?
Project Goals
The goal of this project is to analyze how planets that are closer to their host star compare in their physical and orbital properties, and to determine whether any observed differences can be explained by planetary formation processes.
Current Limitation
A key limitation of this dataset is the uncertainty regarding completeness within individual planetary systems. It is not guaranteed that all planets orbiting a given host star are included, which may introduce observational bias and affect comparisons between planets in the same system. Additionally, detection methods are more likely to identify larger or closer planets, meaning smaller or more distant planets may be underrepresented.
Exploration Data Analysis
This code creates a new column called system_id by using str_replace() to remove the final lowercase letter (such as b or c) from each planet’s name. This group’s planets are from the same star system under a single ID, allowing them to be analyzed together.
nasa_data <- nasa_data |>
mutate(system_id = str_replace(name, " [b-z]$", "")) This code groups the data by system_id and counts the number of planets in each system. It then sorts the systems in descending order to identify those with the most planets. To analyze how planets within the same system compare in orbital and physical properties, we would focus on the top four systems with the highest number of exoplanets.
nasa_data |> group_by(system_id)|> summarise(num_planets = n()) |> arrange(desc(num_planets))# A tibble: 3,951 × 2
system_id num_planets
<chr> <int>
1 TRAPPIST-1 7
2 HD 10180 6
3 HD 191939 6
4 HD 219134 6
5 HD 34445 6
6 K2-138 6
7 Kepler-11 6
8 Kepler-20 6
9 Kepler-80 6
10 TOI-1136 6
# ℹ 3,941 more rows
top_systems <- nasa_data |>
filter(system_id %in% c("TRAPPIST-1", "HD 10180", "HD 191939", "HD 219134"))top_systems |>
group_by(system_id) |>
summarise(num_planets = n()) |>
ggplot(aes(x = reorder(system_id, num_planets), y = num_planets)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Number of Planets per System",
x = "System",
y = "Planet Count")The bar graph shows the top four systems and the number of exoplanets within each system. This highlights differences in planetary abundance across systems.
Exploring discovery year and detection method, while not physical or orbital properties, provides useful context for understanding how exoplanets are identified across different systems and may help explain patterns observed in the data.
top_systems |>
group_by(system_id) |>
summarise(
min_year = min(discovery_year, na.rm = TRUE),
max_year = max(discovery_year, na.rm = TRUE),
span_years = max_year - min_year + 1
) |>
arrange(desc(span_years))# A tibble: 4 × 4
system_id min_year max_year span_years
<chr> <dbl> <dbl> <dbl>
1 HD 191939 2020 2022 3
2 TRAPPIST-1 2016 2017 2
3 HD 10180 2010 2010 1
4 HD 219134 2015 2015 1
The data shows that most exoplanets within each system were discovered within a short time span of one another. For example, systems such as HD 191939 and TRAPPIST-1 have discovery spans of only a few years. This suggests that once an initial exoplanet is detected within a system, it often leads to further discoveries in that same system over a relatively short period.
This pattern may be due to increased observational focus after the first discovery, as astronomers continue to study the same star using similar detection methods. It could also indicate that certain systems are more suitable for detection techniques, making it easier to identify multiple planets once one has already been found.
top_systems |>
group_by(system_id) |>
summarise(
num_methods = n_distinct(detection_method)
) |>
arrange(desc(num_methods)) # A tibble: 4 × 2
system_id num_methods
<chr> <int>
1 HD 191939 2
2 HD 10180 1
3 HD 219134 1
4 TRAPPIST-1 1
top_systems |>
group_by(system_id, detection_method) |>
summarise(count = n()) |>
arrange(system_id, desc(count)) # A tibble: 5 × 3
# Groups: system_id [4]
system_id detection_method count
<chr> <chr> <int>
1 HD 10180 Radial Velocity 6
2 HD 191939 Transit 4
3 HD 191939 Radial Velocity 2
4 HD 219134 Radial Velocity 6
5 TRAPPIST-1 Transit 7
The method exploration investigates the different detection techniques used to discover exoplanets within each system. The results show that most systems rely on a small number of detection methods, with some systems using only one method and others using two. This suggests that once a detection method is effective for a system, it is often used repeatedly to identify additional planets.
The second table compares detection methods within each system and shows how many planets were discovered using each method. For example, some systems have most of their planets detected using Radial Velocity or Transit methods. This highlights that certain detection techniques are more dominant within specific systems.
However, this also suggests a limitation in the data, as some systems may not have been explored using multiple detection methods. As a result, there may be undiscovered planets that could potentially be identified if alternative detection techniques were applied.
ggplot(top_systems, aes(x = system_id, fill = planet_type)) +
geom_bar(position = "fill") +
scale_fill_manual(values = c(
"Gas Giant" = "#A6CEE3",
"Terrestrial" = "#FDBF6F",
"Neptune-like" = "#B2DF8A",
"Super Earth" = "#CAB2D6",
"Unknown" = "#D9D9D9"
)) +
labs(title = "Planet Type Distribution by System",
x = "System",
y = "Proportion")This code shows the distribution of planet types across different systems. The results suggest that similar planet types may be found within the same system, indicating that planetary composition could be influenced by the conditions present during system formation.
For example, HD 10180 contains a mix of Neptune-like planets and gas giants, which are larger exoplanets. In contrast, TRAPPIST-1 is made up of smaller planets such as terrestrial planets and super-Earths, with no gas giants present. This difference highlights how systems can vary significantly in their planetary structure and composition.
ggplot(top_systems, aes(x = system_id, y = eccentricity)) +
geom_boxplot(fill = "skyblue") +
labs(title = "Orbital Eccentricity by System",
x = "System",
y = "Eccentricity")The box plot shows the distribution of eccentricity for each system. Overall, the data shows that half of the systems display strong similarities while the other half do not. TRAPPIST-1 and HD 191939 show a uniform similarity, as they both have very low eccentricity ranges. The flat lines in the plot tell us that the orbits in these systems are nearly identical and circular. In contrast, the two other systems, HD 219134 and HD 10180, have a higher eccentricity range. Their boxes are much taller, which shows a wider range of orbital behaviors among the planets within those systems.
This may indicate that eccentricity is influenced by the overall properties and formation conditions of a system. For example, gravitational interactions refer to the gravitational forces between planets, which can alter their orbits over time and affect orbital eccentricity. Additionally, the protoplanetary disk, a rotating disk of gas and dust surrounding a young star where planets form, can also influence the structure of a system and how circular or eccentric planetary orbits become.
Analysis and Support
Within-System Variation
This code converts planet mass and radius values into a common unit (Earth equivalents) to allow for direct comparison between exoplanets. Since some values are given relative to Jupiter and others relative to Earth, standardising them ensures consistency across the dataset.
top_systems <- top_systems |>
mutate(
mass_est_earth = case_when(
mass_wrt == "Jupiter" ~ mass_multiplier * 317.8,
mass_wrt == "Earth" ~ mass_multiplier,
TRUE ~ NA_real_
),
radius_est_earth = case_when(
radius_wrt == "Jupiter" ~ radius_multiplier * 11.21,
radius_wrt == "Earth" ~ radius_multiplier,
TRUE ~ NA_real_
)
)Exoplanet systems Mass
This code compares the distribution of planet masses within each system by calculating the mean, standard deviation, minimum, maximum, and overall range of mass values
top_systems |>
group_by(system_id) |>
summarise(
mean_mass = mean(mass_est_earth, na.rm = TRUE),
sd_mass = sd(mass_est_earth, na.rm = TRUE),
min_mass = min(mass_est_earth, na.rm = TRUE),
max_mass = max(mass_est_earth, na.rm = TRUE),
mass_range = max_mass - min_mass,
.groups = "drop"
)# A tibble: 4 × 6
system_id mean_mass sd_mass min_mass max_mass mass_range
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 HD 10180 26.6 19.4 11.8 64.5 52.8
2 HD 191939 134. 261. 2.8 660. 657.
3 HD 219134 25.2 40.8 4.36 108. 104.
4 TRAPPIST-1 0.921 0.451 0.326 1.37 1.05
The results indicate that there are clear differences in mass variation between systems. For example, TRAPPIST-1 has a very small mass range and low standard deviation, suggesting its planets are relatively similar in size. In contrast, systems such as HD 191939 show a much larger range and higher standard deviation, indicating a wide variation in planet masses within the system.
Exoplanet Systems Radius
This code compares the distribution of planet radius within each system by calculating the mean, standard deviation, minimum, maximum, and overall range of radius values.
top_systems |>
group_by(system_id) |>
summarise(
mean_radius = mean(radius_est_earth, na.rm = TRUE),
sd_radius = sd(radius_est_earth, na.rm = TRUE),
min_radius = min(radius_est_earth, na.rm = TRUE),
max_radius = max(radius_est_earth, na.rm = TRUE),
radius_range = max_radius - min_radius
)# A tibble: 4 × 6
system_id mean_radius sd_radius min_radius max_radius radius_range
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 HD 10180 5.35 2.14 3.45 9.39 5.94
2 HD 191939 6.61 5.09 2.99 13.3 10.3
3 HD 219134 3.68 4.51 1.31 12.8 11.5
4 TRAPPIST-1 0.979 0.158 0.755 1.13 0.374
The results show differences in radius variation between systems. TRAPPIST-1 has very low mean radius and a very small standard deviation and range, indicating that its planets are consistently small and very similar in size. In contrast, systems such as HD 219134 and HD 191939 show much larger ranges and higher standard deviations, suggesting a wider variation in planet sizes within those systems.
These differences in planetary radius may be influenced by variations in planet type and mass within each system. For example, gas giants tend to have much larger radii compared to terrestrial planets, and higher-mass planets are generally associated with larger sizes. As a result, systems containing more gas giants are likely to show greater variation in radius, while systems dominated by smaller, rocky planets tend to be more uniform.
top_systems |>
ggplot(aes(x = system_id, y = mass_est_earth)) +
geom_boxplot() +
labs(title = "Mass Variation Within Each Planetary System")top_systems |>
ggplot(aes(x = system_id, y = radius_est_earth)) +
geom_boxplot() +
labs(title = "Radius Variation Within Each Planetary System")ggplot(top_systems, aes(x = mass_est_earth, y = radius_est_earth, color = system_id)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Relationship Between Mass and Radius by System",
x = "Mass (Earth units)",
y = "Radius (Earth units)")The data demonstrates a positive correlation between mass and radius across all observed planetary systems; as a planet’s mass increases, its radius generally increases as well. However, the strength of this relationship varies between systems, as shown by differences in the slope of the trend. This suggests that while the mass, radius relationship is a general pattern, the specific composition and density of planets may be influenced by the conditions within their individual planetary systems.
This analysis supports the research question by showing that there is a consistent positive relationship between mass and radius across different systems. This suggests that, regardless of the system, larger planets tend to have greater radius, indicating a general physical relationship between these properties.
However, the variation in the strength of this relationship between systems shows that planetary characteristics are not identical across systems. Instead, differences in composition, density, and formation conditions may cause planets within and between systems to behave differently.
Regression Models
Linear Regression of Orbital Radius on Estimated Earth Radius
model_orbitt1 <- lm(orbital_radius ~ radius_est_earth, data = top_systems)
summary(model_orbitt1)
Call:
lm(formula = orbital_radius ~ radius_est_earth, data = top_systems)
Residuals:
Min 1Q Median 3Q Max
-2.07765 -0.24328 0.03922 0.15236 1.65670
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.25450 0.18686 -1.362 0.186
radius_est_earth 0.21065 0.03353 6.282 2.07e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6456 on 23 degrees of freedom
Multiple R-squared: 0.6318, Adjusted R-squared: 0.6158
F-statistic: 39.46 on 1 and 23 DF, p-value: 2.074e-06
A linear regression model was used to investigate whether larger planets tend to be further from their star by examining the relationship between planet radius and orbital radius.
The results show a statistically significant positive relationship between planet radius and orbital radius, with an R² value of 0.63, showing that planet size explains a large amount of the differences in orbital radius. This suggests that larger planets in the dataset are generally found at greater distances from their star.
Relating this back to the research question, the findings suggest that there are consistent physical, orbital relationships within planetary systems, where planet size is associated with orbital distance. This indicates that while planets within systems may differ in size, their physical properties still follow observable patterns.
Normal QQ-Plot
gg_qqplot(model_orbitt1)The Normal Q-Q plot shows that most points closely follow the reference line, with a slight deviation at the upper end. This suggests that the residuals are approximately normally distributed, providing reasonable support for the assumptions of the linear regression model. As a result, the observed positive relationship between exoplanet radius and orbital radius can be considered reasonably reliable within this dataset.
Cook’s D Plot
gg_cooksd(model_orbitt1,threshold ='matlab')top_systems$name[c(6, 10)][1] "HD 10180 h" "HD 191939 e"
The Cook’s Distance plot identified HD 10180 h and HD 191939 e as highly influential observations. These gas giant planets exert substantial leverage on the regression model because of their much larger radii and orbital distances relative to the smaller terrestrial planets in the sample.
The elevated Cook’s Distance values, particularly for HD 10180 h, suggest that the positive relationship observed between planet radius and orbital radius is strongly influenced by these extreme observations rather than representing a consistent trend across all planet types.
Relationship Between Exoplanet Mass and Orbital Period
model_orbitt2 <- lm(orbital_period ~ mass_est_earth, data = top_systems)
summary(model_orbitt2)
Call:
lm(formula = orbital_period ~ mass_est_earth, data = top_systems)
Residuals:
Min 1Q Median 3Q Max
-1.2798 -0.4871 -0.4495 -0.3713 4.9142
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.448782 0.323756 1.386 0.178987
mass_est_earth 0.009875 0.002369 4.168 0.000371 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.529 on 23 degrees of freedom
Multiple R-squared: 0.4302, Adjusted R-squared: 0.4055
F-statistic: 17.37 on 1 and 23 DF, p-value: 0.0003708
A linear regression model was used to investigate whether planet mass is related to orbital period across the top systems.
The results show a statistically significant positive relationship between mass and orbital period. The R² value of 0.43 indicates that planet mass has a moderate influence on orbital period, suggesting that physical properties may be linked to orbital behavior within planetary systems.
This relationship may reflect how physical properties are linked to orbital behavior within planetary systems. It connects back to the main research question by showing that while planets within systems vary in their physical characteristics, these properties can also be associated with differences in how they orbit their stars.
Normal-QQ Plot
gg_qqplot(model_orbitt2)The Normal Q-Q plot shows that while the majority of observations follow the reference line, there is an upward deviation at the upper end. This suggests a departure from normality in the residuals, indicating a positive skew. This pattern may be driven by a few high-mass planets with exceptionally long orbital periods. While the central data points support the model assumptions reasonably well, these extreme values suggest that the relationship between mass and orbital period may not be strictly linear across all planet types.
Cook’s D Plot
gg_cooksd(model_orbitt2,threshold ='matlab')top_systems$name[c(11)][1] "HD 191939 f"
The Cook’s Distance plot identifies observation 11 (HD 191939 f) as a highly influential point. This planet has a substantially higher mass and an orbital period of approximately 6 years, which is considerably longer than most other planets in the dataset. Its values lie far from the main cluster of observations, so it exerts high leverage on the regression model.
Association Between Exoplanet Size and Orbital Eccentricity
model_er <- lm(eccentricity ~ radius_est_earth, data = top_systems)
summary(model_er)
Call:
lm(formula = eccentricity ~ radius_est_earth, data = top_systems)
Residuals:
Min 1Q Median 3Q Max
-0.06682 -0.03989 -0.02315 0.01648 0.20479
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.048455 0.018805 2.577 0.0169 *
radius_est_earth 0.001377 0.003375 0.408 0.6871
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.06497 on 23 degrees of freedom
Multiple R-squared: 0.007183, Adjusted R-squared: -0.03598
F-statistic: 0.1664 on 1 and 23 DF, p-value: 0.6871
A linear regression model was used to investigate whether planet radius is related to orbital eccentricity across the top systems.
The results show no statistically significant relationship between planet radius and eccentricity (p = 0.687). The R² value of 0.007 indicates that planet radius has almost no influence on eccentricity. This suggests that planet size does not have a meaningful influence on how elliptical a planet’s orbit is in this dataset.
Relating this back to the research question, the findings suggest that while planets within systems may vary in physical size, orbital shape is likely influenced by other factors such as gravitational interactions or system formation conditions rather than planet radius.
Normal -QQ Plot
gg_qqplot(model_er)The Normal Q-Q plot reveals a noticeable deviation from the reference line, with an upward curvature in the upper tail of the residuals, suggesting slight positive skew. This indicates that the residuals are not perfectly normally distributed and may be influenced by a small number of high-eccentricity observations. It is also important to note that eccentricity is bounded between 0 and 1, which can naturally lead to skewed distributions and departures from normality, particularly when values are clustered toward lower eccentricities.
Cook’s D Plot 0
gg_cooksd(model_er,threshold ='matlab')top_systems$name[c(5, 11)][1] "HD 10180 g" "HD 191939 f"
The Cook’s Distance plot highlights observations 5 and 11 (HD 10180 g and HD 191939 f) as influential points, although they lie at opposite ends of the eccentricity range. HD 10180 g has a moderately high eccentricity of 0.26, indicating a more elliptical orbit compared to most planets in the sample. In contrast, HD 191939 f has an eccentricity of 0, representing a perfectly circular orbit. Their influence suggests that any apparent trend between eccentricity and planet radius is strongly dependent on a small number of extreme values.
Conclusion
In conclusion, the data provide evidence that planetary systems show relationships between the physical and orbital properties of exoplanets. Several variables, including mass, radius, orbital radius, and orbital period, display statistically significant relationships, while others, such as radius and eccentricity, show little to no association.
The results suggest that certain planetary properties are strongly related, particularly physical size and orbital characteristics, while other relationships are weak or not present. This may indicates that exoplanet properties are not entirely independent within systems, but instead are influenced by formation and development within the protoplanetary disk and their position within the system.
These patterns may be linked to differences in planetary formation environments, particularly the protoplanetary disk. As this disk provides the material from which planets form, variations in its composition and structure could help explain the differences in planetary types and physical properties observed across systems.
Presentation link: NASA_Exoplanets.mp4
Slide Deck: https://docs.google.com/presentation/d/1OKTc_5xFqF7kJNxd3vi14KfT6ff2EeipFFHu2Dylvio/edit?usp=sharing