Exoplanets and the Fulton Gap

Author

Robert Gravatt

Artist’s conception of the surface of Proxima Centauri b. The Alpha Centauri AB binary system can be seen in the distance, to the upper right of Proxima, as two white dots.

1. Background Information

Exoplanets are planets orbiting stars beyond our solar system. Since the first confirmed detection in 1992, over 6000 have been discovered, revealing diverse properties and orbital behaviors. A key feature of exoplanet demographics is the small planet radius gap, a locally bimodal distribution in planet sizes with few planets between ~1.5–2.0 Earth radii. This gap divides small rocky super-Earths from larger gaseous sub-Neptunes and is thought to result from atmospheric loss processes such as photoevaporation or core-powered mass loss (Fulton et al. 2017 ).

2. Data Source

The NASA Exoplanet Archive [8] aggregates observational data from space missions like Kepler, K2, and TESS, as well as ground-based radial velocity surveys. The collection currently documents 6,028 confirmed exoplanets and provides 84 variables describing discovery methods, host stars, stellar systems, and planetary characteristics. These data are obtained through photometry (the transit method) and spectroscopy (radial velocity measurements). Biases include:

·       Transit method favors large, close-in planets.

·       Radial velocity method better detects massive planets.

3. Statistical Information

The dataset includes both continuous variables (planet radius, orbital period, stellar temperature, luminosity) and categorical variables (detection method, spectral class). It is observational, not experimental, and subject to correlated parameters and uneven sampling.

4. Variables Defined

The NASA Exoplanet Archive provides the following relevant variables for this project:

·       pl_rade: Planet radius (Earth radii)

·       pl_orbper: Orbital period (days)

·       pl_bmasse: Planet mass (Earth masses)

·       pl_orbecen: Orbital eccentricity (dimensionless, 0 = circular orbit)

·       st_teff: Stellar effective temperature (Kelvin)

·       st_spectype: Stellar spectral type (categorical, e.g., G2V, M3V)

·       discoverymethod: Detection method (Transit, Radial Velocity, etc.)

·       sy_snum: Number of stars in the system (e.g., 1 for single star, 2 for binary, etc.)

·       sy_pnum: Number of planets in the system

5. Background Research

Fulton et al. (2017) identified the small planet radius gap using Kepler Space Telescope data, showing a scarcity of planets between 1.6–2.0 Earth radii. Petigura et al. (2018) demonstrated that hot Jupiters are more common around metal-rich stars, supporting theories of giant planet formation tied to heavy elements. The NASA Exoplanet Archive serves as the central repository for exoplanet data, currently listing thousands of confirmed planets with standardized parameters (“NASA Exoplanet Archive”). The TESS mission, launched in 2018, expanded the search for exoplanets by surveying bright nearby stars, adding hundreds of small planets to the catalog and refining our understanding of planetary demographics (Ricker et al. 2015).

Key Facts:

1.     The radius gap spans ~1.6–2.0 Earth radii. [6]

2.     Super-Earths and sub-Neptunes dominate small planet populations. [10]

3.     Hot Jupiters preferentially orbit metal-rich stars. [7, 11]

4.     The NASA Exoplanet Archive contains over 6000 confirmed planets. [8]

5.     TESS broadened exoplanet discovery to bright nearby stars. [9, 12]

6. Research Question

How do detection method, stellar properties, and system architecture influence the observed distribution of exoplanet radii, particularly the small planets radius gap?

Load NASA Exoplanet Archive

library(tidyverse)

exoplanets <- read_csv("Exoplanets.csv")

Plot the Radii of Known Exoplanets

# Clean dataset
exoplanets_clean <- exoplanets |>
  filter(!is.na(pl_rade))

# Refined histogram
exoplanets_clean |>
  ggplot(aes(x = pl_rade)) +
  geom_histogram(bins = 50, fill = "forestgreen", color = "white") +
  geom_vline(xintercept = c(1.6, 2), linetype = "dashed", color = "blue") +
  scale_x_log10() +
  labs(
    title = "Distribution of Exoplanet Radii",
    x = "Planet Radius (Earth radii, log scale)",
    y = "Count"
  ) +
  
  annotate(
    "text",
    x = 1.3, y = -50,   
    label = "Fulton gap",
    color = "blue",
    hjust = 0
  ) +
  theme_minimal()

Plot Interpretation

This histogram displays the distribution of exoplanet radii on a logarithmic scale, revealing a clear bimodal pattern. The first peak occurs around 2-3 Earth radii, corresponding to rocky planets and super-Earths, while the second peak appears near 15 Earth radii, representing gas giants like Jupiter and Saturn. The dip between these peaks reflects the relative scarcity of intermediate-sized planets. Once a planetary core reaches a critical mass (~10 Earth masses), runaway gas accretion occurs, producing giant planets. Overall, the plot highlights two dominant populations in the exoplanet census: small terrestrial worlds and large gaseous planets.

[1] “Small planet radius gap.” Wikipedia, Wikimedia Foundation, 13 Dec. 2025, https://en.wikipedia.org/wiki/Small_planet_radius_gap.

Exoplanet Radii Faceted by Discovery Method

# Clean dataset
exoplanets_clean <- exoplanets |>
  filter(!is.na(pl_rade), !is.na(discoverymethod))

# a custom palette of 11 vivid colors
vivid_colors <- c(
  "forestgreen",
  "cyan",
  "royalblue",
  "darkorchid",
  "goldenrod",
  "dodgerblue",
  "darkorange",
  "deeppink",
  "steelblue",
  "mediumseagreen",
  "brown"
)

# Histogram faceted by discovery method with vivid colors
exoplanets_clean |>
  ggplot(aes(x = pl_rade, fill = discoverymethod)) +
  geom_histogram(bins = 40) +
  scale_x_log10() +
  scale_fill_manual(values = vivid_colors) +
  labs(
    title = "Distribution of Exoplanet Radii by Discovery Method",
    x = "Planet Radius (Earth radii, log scale)",
    y = "Count"
  ) +
  facet_wrap(~discoverymethod, scales = "free_y") +
  theme_minimal() +
  theme(
    legend.position = "none",              # remove legend
    strip.text = element_text(size = 7),   # small facet titles
    plot.title = element_text(size = 12, face = "bold")
  )

Interpretation of Plots

This faceted histogram reveals how different exoplanet discovery methods are biased toward detecting planets of specific sizes. Each panel shows the distribution of planet radii for a particular method, plotted on a logarithmic scale to accommodate the wide range of sizes—from Earth-like rocky planets to massive gas giants.

Transit and radial velocity methods dominate the dataset and show broad distributions. Transit detections peak around 1–3 Earth radii, reflecting their sensitivity to small planets that periodically block starlight. Radial velocity methods, on the other hand, favor larger planets (5–20 Earth radii), since massive planets induce stronger stellar wobbles that are easier to measure. Transit is best for detecting Earth-sized planets due to its sensitivity to small dips in starlight, while radial velocity excels at finding gas giants by measuring the strong stellar wobbles they induce.

Microlensing and direct imaging methods show distinct patterns. Microlensing tends to detect planets at intermediate radii, often several Earth radii, due to its reliance on gravitational lensing events. Direct imaging skews heavily toward very large planets because only massive, widely separated planets are bright enough to be resolved against their host stars.

Other methods like astrometry and eclipse timing show narrower distributions, often with fewer detections, reflecting their more specialized or less mature status. Overall, the plot highlights how observational technique shapes our view of planetary populations, with each method contributing a unique slice of the exoplanet census.

[2] Methods of Detecting Exoplanets. Wikipedia, Wikimedia Foundation, 14 Nov. 2025, en.wikipedia.org/wiki/Methods_of_detecting_exoplanets..

Exoplanet Radii by Host Star Luminosity

# Clean dataset: keep planets with radius and spectral type, drop NA luminosity classes
exoplanets_clean <- exoplanets |>
  filter(!is.na(pl_rade), !is.na(st_spectype)) |>
  mutate(
    lum_class = str_extract(st_spectype, "(I{1,3}|II|V)")   # extract I, II, III, V
  ) |>
  filter(!is.na(lum_class))   # drop NA classes

# Histogram of planet radius faceted by luminosity class
exoplanets_clean |>
  ggplot(aes(x = pl_rade, fill = lum_class)) +
  geom_histogram(bins = 40, alpha = 0.7) +
  scale_x_log10() +
  scale_fill_manual(values = c("I" = "firebrick",
                               "II" = "goldenrod",
                               "III" = "forestgreen",
                               "V" = "royalblue")) +
  labs(
    title = "Exoplanet Radius Distribution by Host Luminosity Class",
    x = "Planet Radius (Earth radii, log scale)",
    y = "Count",
    fill = "Luminosity Class"
  ) +
  facet_wrap(~lum_class, scales = "free_y") +
  theme_minimal() +
  theme(
    legend.position = "none",                 # remove legend
    strip.text = element_text(size = 9),
    plot.title = element_text(size = 12, face = "bold")
  )

Interpretation

This faceted histogram shows how the distribution of exoplanet sizes varies depending on the luminosity class of their host stars. Planets around main sequence stars (V) are the most numerous, with a broad spread from Earth‑sized worlds up to gas giants, reflecting the fact that most surveys target Sun‑like stars. Systems around giant stars (III) tend to host larger planets, since close‑in small planets may be engulfed as the star expands. Supergiants (I) and bright giants (II) appear less frequently in the dataset, but when present they are more often associated with massive planets, consistent with observational biases and stellar evolution effects. Overall, the plot highlights that stellar type strongly influences both the detectability and survival of planets, with dwarfs yielding the richest diversity and giants skewing toward larger companions.

[3] Planet-hosting star. Wikipedia, Wikimedia Foundation, 14 Nov. 2025, en.wikipedia.org/wiki/Planet-hosting_star..

Histogram of Number of Planets per Stellar System

ggplot(exoplanets, aes(x = sy_pnum)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Number of Planets per System",
       x = "Planets in System", y = "Count") +
  theme_minimal()

Interpretation

This bar chart shows that most known exoplanet systems contain only a single detected planet, with progressively fewer systems hosting multiple planets.

Host Star Temperature by Year of Discovery

exoplanets |>
  group_by(disc_year) |>
  summarize(mean_teff = mean(st_teff, na.rm = TRUE)) |>
  ggplot(aes(x = disc_year, y = mean_teff)) +
  geom_line(color = "purple") +
  geom_vline(xintercept = 2009, linetype = "dashed", color = "blue") +
  annotate("text", x = 2009, y = 4000,   # lower position
           label = "Kepler Space Telescope            May 2009",
           angle = 90, vjust = -0.5, hjust = 0, color = "blue") +
  labs(title = "Average Host Star Temperature by Year of Discovery",
       x = "Discovery Year", y = "Mean Temperature (K)") +
  theme_minimal()

Plot Interpretation

By 2010, the Kepler Space Telescope had just begun operations (launched in 2009). Kepler was designed to monitor over 150,000 stars, most of them F‑ and G‑type stars with effective temperatures in the range of 6000–7000 K. These stars were chosen because they are bright enough for precise photometry and similar to the Sun, making them ideal targets for detecting Earth‑sized planets via the transit method.

The spike in 2010 reflects the first wave of Kepler discoveries, which dramatically increased the number of known exoplanets. Unlike earlier radial velocity surveys that favored nearby cooler stars, Kepler’s wide‑field approach focused on hotter, Sun‑like stars in a distant patch of the Milky Way. This explains why the NASA dataset shows an average host star temperature of 6500 K in 2010.

[4] Kepler Space Telescope. Wikipedia, Wikimedia Foundation, 14 Nov. 2025, en.wikipedia.org/wiki/Kepler_space_telescope.

Diagram of Exoplanet Radius versus Orbital Period for Binary Star Systems

library(scales)
my_colors <- c(
  "steelblue",        # steel blue
  "orange",      # vivid orange
  "red",         # bright red
  "grey",        # grey
  "maroon",      # maroon
  "green",       # strong green
  "gold",        # yellow/gold
  "black"        # black
)

binary_planets <- exoplanets |> filter(sy_snum == 2)  # filter for binary star systems


ggplot(binary_planets, aes(x = pl_orbper, y = pl_rade, color = discoverymethod)) +
  geom_point(alpha = 0.6) +
  scale_x_log10(labels = label_comma()) +
  scale_y_log10(labels = label_comma()) +
  scale_color_manual(values = my_colors) +
  labs(
    title = "Exoplanet Radius vs Orbital Period (Binary Systems)",
    x = "Orbital Period (days, log scale)",
    y = "Planet Radius (Earth radii, log scale)",
    color = "Discovery Method"
  ) +
  theme_minimal()

Interpretation

Exoplanet discoveries in binary star systems reveal how stellar multiplicity influences planet formation and detection. When plotting planet radius against orbital period for these systems, the distribution shows that many planets orbiting binaries tend to be larger and often detected through radial velocity or transit methods, reflecting observational biases and the dynamical challenges of detecting smaller, longer‑period planets in multi‑star environments. This pattern highlights both the resilience of planet formation in complex gravitational settings and the limitations of current detection techniques, offering insight into how planetary systems differ when more than one star is present.

[5] “Circumbinary Planet.” Wikipedia, Wikimedia Foundation, 17 Nov. 2025, en.wikipedia.org/wiki/Circumbinary_planet..

Plot of Orbital Periods of Known Exoplanets

# --- Histogram of Orbital Periods without scientific notation

library(scales)  # for label_number

exoplanets |>
  filter(!is.na(pl_orbper)) |>
  ggplot(aes(x = pl_orbper)) +
  geom_histogram(bins = 50, fill = "dodgerblue", color = "white") +
  scale_x_log10(labels = label_number()) +   # removes scientific notation
  labs(
    title = "Distribution of Exoplanet Orbital Periods",
    x = "Orbital Period (days, log scale)",
    y = "Count"
  ) +
  theme_minimal()

Description of Distributions

Planet Radius (pl_rade): The distribution of planet radii is bimodal, reflecting two dominant classes of exoplanets. The first peak corresponds to smaller, rocky planets with radii near Earth’s size, while the second peak represents larger gas and ice giants, such as Jupiter‑like planets. This bimodality highlights the natural division between terrestrial and giant planets in the dataset.

Orbital Period (pl_orbper): The distribution of orbital periods is right-skewed. This is due to a selection bias as planets with longer orbital periods are more difficult to detect because they are further away from their host stars.

Statistical Analysis

1. Categorical Variables

Two categorical variables of the dataset are the discover method discoverymethod and the host star spectral type st_spectype.

The discovery method records the technique used to detect each exoplanet, such as Transit, Radial Velocity, Microlensing, or Imaging.

The star spectral type indicates the spectral type of the host star (O, B, A, F, G, K, M), which classifies stars by their temperature, color, and luminosity.

Bargraph of Discovery Methods

# Bargraph of Discovery Methods 
exoplanets |>
  filter(!is.na(discoverymethod)) |>
  ggplot(aes(x = discoverymethod, fill = discoverymethod)) +
  geom_bar(color = "black") +
  labs(
    title = "Exoplanets by Discovery Method",
    x = "Discovery Method",
    y = "Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = FALSE)

Bargraph of Host Star Spectral Types

# --- Simplify Spectral Types to Main Classes ---
exoplanets |>
  filter(!is.na(st_spectype)) |>
  mutate(st_class = str_sub(st_spectype, 1, 1)) |>   # take first letter only
  ggplot(aes(x = st_class, fill = st_class)) +
  geom_bar(color = "black") +
  labs(
    title = "Exoplanets by Host Star Spectral Class",
    x = "Spectral Class",
    y = "Count"
  ) +
  theme_minimal() +
  guides(fill = FALSE)

Description of Categorical Distributions

Discovery Method shows how exoplanets have been detected, and the bargraph reveals that the majority were discovered using the Transit method, followed by Radial Velocity. Transit photometry measures the dip in starlight when a planet passes in front of its star, while radial velocity detects the subtle wobble of a star caused by an orbiting planet. These two methods dominate because they are the most efficient and widely applied techniques, especially in large surveys such as Kepler and TESS. Other approaches, including Microlensing and Direct Imaging, account for far fewer discoveries due to their technical challenges and narrower observational reach. This distribution reflects both the strengths of transit and radial velocity methods and the observational biases that shape our current catalog of exoplanets.

Host Star Spectral Type provides insight into the kinds of stars that exoplanets orbit. When simplified to the main spectral classes, the bargraph shows that most exoplanets are found around F, G, K, and M stars. These letters correspond to stars of decreasing temperature: F stars are hotter and brighter than the Sun, G stars are Sun‑like and yellow in color, K stars are cooler orange stars, and M stars are the coolest and most common red dwarfs. Very few exoplanets are found around massive O or B stars, which are rare and short‑lived. This distribution reflects both astrophysical realities—most stars in the galaxy are K and M types—and observational biases, since cooler, smaller stars make planetary signals easier to detect.

Sampling: The dataset more than 6000 observations (thousands of confirmed exoplanets). We take a random sample of size 500 and name it samp.

set.seed(666)  # for reproducibility
samp <- exoplanets_clean |> sample_n(500)

2. Chi Squared Test

Variables:

  • Categorical 1: discoverymethod

  • Categorical 2: st_spectype (simplified to first letter: O, B, A, F, G, K, M)

Hypotheses:

  • Ho: Discovery method and host star spectral type are independent.

  • Ha: Discovery method and host star spectral type are associated (not independent).

# Simplify spectral type
samp <- samp |> mutate(st_class = str_sub(st_spectype, 1, 1))

# Chi-square test
chisq_test <- chisq.test(table(samp$discoverymethod, samp$st_class))
chisq_test

    Pearson's Chi-squared test

data:  table(samp$discoverymethod, samp$st_class)
X-squared = 265.5, df = 24, p-value < 2.2e-16

Chi-squared = 265.5 , p-value = 0

Conclusion:

Because the p‑value is effectively 0, we reject the null hypothesis of independence between discovery method and stellar spectral class. This association makes sense scientifically. For example, the transit method is more successful with smaller, dimmer stars, while radial velocity techniques are favored for brighter stars with stable spectra.

3. Odds Ratio and Relative Risk

To make the calculation meaningful, we need to reduce the categories into two groups for each variable. For example:

  • Discovery Method: Transit vs. Radial Velocity

  • Host Star Spectral Type: M stars vs. non‑M stars

Question: Are planets around M‑type stars more likely to be discovered by Transit compared to Radial Velocity?

# Collapse to M vs Non-M
samp <- samp |> mutate(M_star = ifelse(st_class == "M", "M", "Non-M"))

# Keep only Transit and Radial Velocity
samp2 <- samp |> filter(discoverymethod %in% c("Transit", "Radial Velocity"))

# Build 2x2 table
tab2 <- table(samp2$discoverymethod, samp2$M_star)
tab2
                 
                    M Non-M
  Radial Velocity  36   225
  Transit          71   145

Odds Ratio

Odds of Transit vs. Radial Velocity for M stars: Odds_M = 71/ 36 = 1.97

Odds of Transit vs. Radial Velocity for Non‑M stars: Odds_NonM = 145 / 225 = 0.64

Odds Ratio: OR = 1.97 / 0.64 = 3.07

Interpretation: Planets around M‑type stars are about three times more likely to be discovered via Transit compared to Radial Velocity, relative to planets around Non‑M stars.

Relative Risk

Risk of Transit among M stars: 71 / (71 + 36) = 71 / 107 = 0.664

Risk of Transit among Non‑M stars: 145 / (145 + 225) = 145 / 370 = 0.392

Relative Risk: RR = 0.664 / 0.392 = 1.69

Interpretation: Planets around M‑type stars are about 69% more likely to be discovered via Transit compared to Non‑M stars.

Conclusion

The analysis shows a strong association between host star type and discovery method: planets around M‑type stars are significantly more likely to be detected using the Transit method than the Radial Velocity method. Both the odds ratio (≈3.07) and relative risk (≈1.69) confirm that Transit discoveries occur disproportionately in M‑star systems compared to non‑M stars.

4. Kruskal-Wallis Nonparametric Test

Two Variables

Categorical variable: : Spectral type (st_class) — groups such as A, F, G, K, M stars

Quantitative variable: pl_rad

This pairing makes sense because planet radius is continuous, and host star spectral type is categorical.

Null hypothesis (Ho): The distribution of planet radii is the same across host stars of spectral types F, G, K, and M. In other words, there is no difference in the central tendency of planet radii among these groups.

Alternative hypothesis (Ha): At least one spectral type group (F, G, K, or M) has a different distribution of planet radii compared to the others.

Check Basic Assumptions

Independence: Each exoplanet is an independent observation.

Normality: Planet radii in each group should be approximately normal. We know that planet radii are bimodal.

Equal variances: Variability in planet radii should be similar across groups.

In this case the basic assumptions for a parametric t-test or ANOVA fail.

Because of these violations, the correct approach is to use a non‑parametric test:

For multiple groups (e.g., spectral types), use the Kruskal–Wallis test.

# Check sample sizes by spectral type
table(samp$st_class)

  A   B   F   G   K   m   M 
  6   3  54 177 147   1 112 

There are too few observations of types A, B and m to be meaningful. We will consider only types F, G, K, M

## Remove type A, B and m stars
samp_noA <- samp |>
  filter(st_class %in% c("F","G","K","M"))


## Visualize distributions (assumptions check)
ggplot(samp_noA, aes(x = st_class, y = pl_rade)) +
  geom_boxplot(fill = "lightblue") +
  scale_y_log10() +
  labs(title = "Planet Radius by Host Star Spectral Type",
       x = "Spectral Type",
       y = "Planet Radius (Earth radii, log10 scale)") +
  theme_minimal()

As can be seen from the above boxplots, the basic assumption for ANOVA fail. We must run a Kruskal-Wallis test.

kruskal.test(pl_rade ~ st_class, data = samp_noA)

    Kruskal-Wallis rank sum test

data:  pl_rade by st_class
Kruskal-Wallis chi-squared = 126.33, df = 3, p-value < 2.2e-16

The Kruskal–Wallis test compared planet radii across the four host star spectral types (F, G, K, M). The test statistic was χ2=126.33, df = 3, p‑value < 2.2e‑16.

Result: Because the p‑value is far below 0.05, we reject the null hypothesis.

Conclusion: Planet radius distributions differ significantly among at least one pair of spectral types.

Next step: A post hoc test (Dunn’s test) is needed to identify which spectral types differ from each other.

# Post hoc Dunn's test
# Using the dunn.test package for pairwise comparisons
library(dunn.test)

dunn.test(samp_noA$pl_rade, samp_noA$st_class, method = "bonferroni") 
  Kruskal-Wallis rank sum test

data: x and group
Kruskal-Wallis chi-squared = 126.3327, df = 3, p-value = 0

                           Comparison of x by group                            
                                 (Bonferroni)                                  
Col Mean-|
Row Mean |          F          G          K
---------+---------------------------------
       G |   0.872406
         |     1.0000
         |
       K |   2.682570   2.609922
         |    0.0219*     0.0272
         |
       M |   8.385412   10.38258   7.672801
         |    0.0000*    0.0000*    0.0000*

alpha = 0.05
Reject Ho if p <= alpha/2

F vs G: No difference (p = 1.000).

F vs K: Significant difference (p = 0.0219 < 0.025).

F vs M: Highly significant difference (p < 0.0001).

G vs K: No difference (p = 0.0272 > 0.025).

G vs M: Highly significant difference (p < 0.0001).

K vs M: Highly significant difference (p < 0.0001).

Conclusion

The Kruskal–Wallis test indicated highly significant differences in planet radius across spectral types (χ² = 126.33, df = 3, p < 0.0000001). A post hoc Dunn’s Test revealed that planets orbiting M‑type stars have significantly different radii compared to those around F, G, and K stars, underscoring the distinct nature of M‑star systems. Additional differences were observed between F and K stars, while F and G stars as well as G and K stars showed no meaningful difference.

5. Multiple Linear Regression

Full Model

# log10 planet radius to mitigate right skew
samp_noA$log_rade <- log10(samp_noA$pl_rade)

# Full model: start broad with plausible host-star predictors
full_mod <- lm(log_rade ~ st_class + st_mass + st_rad + st_met + sy_dist,
               data = samp_noA)

summary(full_mod)                   # p-values, adjusted R^2

Call:
lm(formula = log_rade ~ st_class + st_mass + st_rad + st_met + 
    sy_dist, data = samp_noA)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.95198 -0.21302  0.06471  0.22240  0.90507 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.241e-01  7.114e-02  11.583  < 2e-16 ***
st_classG   -6.084e-02  5.176e-02  -1.175  0.24048    
st_classK   -1.627e-01  5.617e-02  -2.896  0.00396 ** 
st_classM   -4.888e-01  6.586e-02  -7.422 5.76e-13 ***
st_mass      6.968e-02  4.617e-02   1.509  0.13197    
st_rad       8.198e-03  2.886e-03   2.841  0.00470 ** 
st_met       4.523e-01  7.190e-02   6.291 7.46e-10 ***
sy_dist      4.535e-05  5.772e-05   0.786  0.43249    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3145 on 453 degrees of freedom
  (29 observations deleted due to missingness)
Multiple R-squared:  0.3855,    Adjusted R-squared:  0.376 
F-statistic:  40.6 on 7 and 453 DF,  p-value: < 2.2e-16
summary(full_mod)$adj.r.squared     # adjusted R^2
[1] 0.3759976

Backward Elimination

best_mod <- stats::step(full_mod, direction = "backward")
Start:  AIC=-1058.7
log_rade ~ st_class + st_mass + st_rad + st_met + sy_dist

           Df Sum of Sq    RSS     AIC
- sy_dist   1    0.0610 44.859 -1060.1
<none>                  44.798 -1058.7
- st_mass   1    0.2252 45.023 -1058.4
- st_rad    1    0.7982 45.596 -1052.6
- st_met    1    3.9133 48.711 -1022.1
- st_class  3    8.5704 53.369  -984.0

Step:  AIC=-1060.07
log_rade ~ st_class + st_mass + st_rad + st_met

           Df Sum of Sq    RSS      AIC
<none>                  44.859 -1060.07
- st_mass   1    0.2590 45.118 -1059.42
- st_rad    1    0.8105 45.670 -1053.82
- st_met    1    3.9511 48.810 -1023.16
- st_class  3    8.6118 53.471  -985.11
summary(best_mod)

Call:
lm(formula = log_rade ~ st_class + st_mass + st_rad + st_met, 
    data = samp_noA)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.91862 -0.21869  0.06138  0.22111  0.90068 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.833466   0.070101  11.889  < 2e-16 ***
st_classG   -0.067467   0.051049  -1.322  0.18696    
st_classK   -0.170524   0.055253  -3.086  0.00215 ** 
st_classM   -0.495023   0.065347  -7.575 2.03e-13 ***
st_mass      0.074153   0.045803   1.619  0.10615    
st_rad       0.008258   0.002883   2.864  0.00438 ** 
st_met       0.454203   0.071827   6.324 6.13e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3143 on 454 degrees of freedom
  (29 observations deleted due to missingness)
Multiple R-squared:  0.3847,    Adjusted R-squared:  0.3765 
F-statistic:  47.3 on 6 and 454 DF,  p-value: < 2.2e-16

Regression equation (from backward elimination):

log10(Planet Radius) = 0.8241 – 0.627 × (star class = K) – 0.4888 × (star class = M) + 0.0082 × (stellar radius) + 0.4523 × (stellar metallicity)

Intercept = 0.8241

Coefficient for K stars = −0.1627 (relative to baseline F stars)

Coefficient for M stars = −0.4888 (relative to baseline F stars)

Coefficient for stellar radius = +0.0082

Coefficient for stellar metallicity = +0.4523

par(mfrow = c(2, 2))
plot(best_mod)

par(mfrow = c(1, 1))

Addressing the basic assumptions of linear regression:

Linearity: The residuals versus fitted values plot showed no strong curvature, which supports the assumption that the relationship between predictors and the response is linear.

Independence: Each planetary system is treated as an independent observation, so the independence assumption is reasonable.

Constant variance: The scale-location plot showed a slight upward trend, suggesting mild heteroscedasticity, meaning the variance of residuals may increase with fitted values.

Normality of residuals: The Q-Q plot indicated that residuals were approximately normally distributed, with only minor deviations at the tails.

Influential points: The residuals versus leverage plot revealed a few points near or beyond the distance threshold, suggesting some influence, but nothing extreme.

Conclusions

The final regression equation was: log10(Planet Radius) = 0.8241 – 0.627 × (star class = K) – 0.4888 × (star class = M) + 0.0082 × (stellar radius) + 0.4523 × (stellar metallicity)

P‑values: The categorical predictors for star class K and M were statistically significant, indicating that planets around these stars tend to be smaller than those around F stars. The coefficient for G stars was not significant, suggesting no meaningful difference from F stars. Stellar radius and stellar metallicity were also significant, both with positive coefficients, meaning larger and more metal-rich stars tend to host larger planets. Stellar mass and system distance were not statistically significant, with p-values of 0.13 and 0.43 respectively, indicating weak or negligible effects.

Adjusted R²: The adjusted R² was approximately 0.376, meaning the model explains about 38% of the variation in log-transformed planet radius. This represents a moderate level of explanatory power but is still low by physical science standards.

Diagnostic plots: The residuals versus fitted values plot showed no strong curvature, supporting the assumption of linearity. The Q-Q plot indicated that residuals were approximately normal, with only minor deviations at the tails. The scale-location plot suggested mild heteroscedasticity, with variance increasing slightly at higher fitted values. The residuals versus leverage plot revealed a few points near the Cook’s distance threshold, suggesting some influence but no extreme outliers. Overall, the assumptions of linear regression were reasonably satisfied.

Summary: The regression model identifies spectral type, stellar radius, and stellar metallicity as significant predictors of planet radius, while stellar mass and system distance were not significant. The model explains a moderate proportion of the variance, and diagnostic checks confirm that the assumptions of linear regression are largely met, with only minor concerns about heteroscedasticity and influential observations.

Comparison to Krukal-Wallis Test: The nonparametric analysis above showed that planet radius differs significantly across stellar spectral classes. The Kruskal–Wallis test confirmed overall differences, and the Dunn post hoc comparisons indicated that planets around K and especially M stars are significantly smaller than those around F stars, while G stars did not differ from F stars. The regression model is consistent with these findings: the coefficients for K and M classes were negative and statistically significant, showing the same pattern of smaller planets relative to the F-star baseline. The regression analysis extended these results by identifying stellar radius and metallicity as significant continuous predictors, both with positive effects on planet radius, while stellar mass and system distance were not significant. The adjusted R² of 0.376 indicates that the model explains a moderate proportion of the variance, and diagnostic plots showed that the assumptions of linear regression were reasonably met, with only mild heteroscedasticity and a few potentially influential points. Taken together, the Kruskal–Wallis and Dunn tests established the importance of spectral class differences, and the regression model confirmed those differences while adding further explanatory variables and quantifying their effects.

6. Test of the Fulton Gap of Small Exoplanet Radii - before and after 2017

Histograms of the Data

exoplanets |>
  filter(!is.na(pl_rade)) |>
  mutate(period = ifelse(disc_year < 2017, "Discovered before 2017", "Discovered 2017 or later")) %>%
  ggplot(aes(x = pl_rade, fill = period)) +
  geom_histogram(bins = 50, color = "white") +
  scale_x_log10() +
  facet_wrap(~ period, ncol = 2) +
  labs(
    title = "Distribution of Exoplanet Radii by Discovery Era",
    x = "Planet Radius (Earth radii, log10 scale)",
    y = "Count"
  ) +
  scale_fill_manual(values = c("Discovered before 2017" = "steelblue",
                               "Discovered 2017 or later" = "darkorange")) +
  theme_minimal() +
  theme(legend.position = "none")

Histograms Showing the Fulton Gap (using 1.6 - 2.0 Earth radii)

exoplanets |>
  filter(!is.na(pl_rade), !is.na(disc_year)) |>
  mutate(period = ifelse(disc_year < 2017, "Discovered before 2017", "Discovered 2017 or later")) |>
  ggplot(aes(x = pl_rade)) +
  geom_histogram(bins = 60, fill = "grey70", color = "white") +
  geom_density(aes(y = ..scaled.. * 40), color = "red", size = 0.2) +

  scale_x_log10() +
  facet_wrap(~ period, ncol = 2) +
  geom_vline(xintercept = 1.6, linetype = "dashed", color = "blue") +
  geom_vline(xintercept = 2.0, linetype = "dashed", color = "blue") +
  labs(
    title = "Exoplanet Radius Distributions (Testing Fulton Gap)",
    x = "Planet Radius (Earth radii, log scale)",
    y = "Count (scaled with density)",
    caption = "Red line = smoothed distribution density curve"
  ) +
   annotate(
    "text",
    x = 1.1, y = -50,   
    label = "Fulton gap",
    color = "blue",
    hjust = 0
  ) +
  theme_minimal()

A Closer Look

ggplot(
  subset(exoplanets, pl_rade >= 1.3 & pl_rade <= 2.3),
  aes(x = pl_rade)
) +
  geom_histogram(bins = 10, fill = "grey70", color = "white") +
  # geom_density(aes(y = ..count../70), color = "red", size = 0.4) +
  geom_vline(xintercept = c(1.6, 2), linetype = "dashed", color = "blue") +
  facet_wrap(~ ifelse(disc_year < 2017, "Discovered before 2017", "Discovered 2017 or later")) +
  labs(
    title = "Exoplanet Radius Distribution (1.3–2.3 Earth radii)",
    x = "Planet Radius (Earth radii)",
    y = "Count"
  ) +
   annotate(
    "text",
    x = 1.7, y = -2,   
    label = "Fulton gap",
    color = "blue",
    hjust = 0
  ) +
  theme_minimal()

Hypothesis Testing:

Null hypothesis (Ho): The proportion of exoplanets in the Fulton gap (1.6–2.0 Earth radii) is the same across discovery eras (pre‑2017 vs post‑2017). In other words, being discovered before or after 2017 does not affect the likelihood that a planet falls in the gap.

Alternative hypothesis (Ha): The proportion of exoplanets in the Fulton gap differs between discovery eras. In other words, the distribution of planets across the gap vs non‑gap categories is not independent of discovery era.

# Create gap indicator and counts
counts <- exoplanets |>
  filter(!is.na(pl_rade), !is.na(disc_year)) |>
  mutate(period = ifelse(disc_year < 2017, "Discovered before 2017", "Discovered 2017 or later"),
         in_gap = pl_rade >= 1.6 & pl_rade <= 2.0) |>
  group_by(period, in_gap) |>
  summarise(n = n(), .groups = "drop") |>
  pivot_wider(names_from = in_gap, values_from = n, values_fill = 0)

counts
# A tibble: 2 × 3
  period                   `FALSE` `TRUE`
  <chr>                      <int>  <int>
1 Discovered 2017 or later    2375    211
2 Discovered before 2017      3020    398
  • Before 2017 → ~11.6% of planets in the gap (398 / (3020+398)).

    2017 or later → ~8.2% in the gap (211 / (2375+211)).

# Chi-square test on gap vs non-gap counts
chisq.test(counts |> select(-period))

    Pearson's Chi-squared test with Yates' continuity correction

data:  select(counts, -period)
X-squared = 19.236, df = 1, p-value = 1.155e-05

Fulton Gap Test Results :

The chi‑square test comparing Fulton gap occupancy across eras was highly significant (χ² = 19.23, df = 1, p ≈ 1.56 × 10⁻⁵), leading us to reject the null hypothesis that the proportion of planets in the gap is unchanged before and after 2017. Planets discovered after 2017 were less likely to fall in the 1.6–2.0 Earth‑radii range (8.2% vs. 11.6%), consistent with a sharpening of the valley. However, this apparent sharpening must be interpreted cautiously. In the later era, the radius distribution shows a plateau within the gap, and the proportions of planets discovered are subject to survey selection effects. Kepler provided the original transit detections, but its stellar radius estimates were often uncertain, which blurred the planet size distribution. With Gaia’s precise parallaxes and improved spectroscopy, stellar sizes became much more reliable, and because planetary radii are derived directly from stellar radii, the Fulton gap emerged with stronger statistical definition. In short, the gap was always present, but its visibility reflects both improved stellar measurements and the biases inherent in different discovery surveys.

Further References

[6] Fulton, Benjamin J., et al. “The California-Kepler Survey. III. A Gap in the Radius Distribution of Small Planets.” The Astronomical Journal, vol. 154, no. 3, 2017, p. 109. doi:10.3847/1538-3881/aa80eb.

[7] Petigura, Erik A., et al. “The California-Kepler Survey. IV. Metal-Rich Stars Host Giant Planets.” The Astronomical Journal, vol. 155, no. 2, 2018, p. 89. doi:10.3847/1538-3881/aaa54d.

[8]  “NASA Exoplanet Archive.” NASA, https://exoplanetarchive.ipac.caltech.edu/. . Accessed 23 Nov. 2025.

[9] Ricker, George R., et al. “Transiting Exoplanet Survey Satellite (TESS).” Journal of Astronomical Telescopes, Instruments, and Systems, vol. 1, no. 1, 2015, p. 014003. DOI: 10.1117/1.JATIS.1.1.014003.

[10] Rogers, L. A. Formation, Structure and Habitability of Super-Earth and Sub-Neptune Exoplanets. PhD Thesis, MIT (2012).

[11] Osborn, A., Bayliss, D. Investigating the planet–metallicity correlation for hot Jupiters. Monthly Notices of the Royal Astronomical Society 491, 4481–4487 (2020), pp. 4481–4487. DOI: 10.1093/mnras/stz3207.

[12] NASA Science Mission Directorate. Transiting Exoplanet Survey Satellite (TESS). NASA Science, 2018 - present