Data Dive Seven

Hypothesis Testing

Load library

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(pwrss)
## Warning: package 'pwrss' was built under R version 4.5.2
## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

Load NASA data

nasa_data <- read_delim("C:/Users/imaya/Downloads/cleaned_5250.csv",delim = ",")
## Rows: 5250 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, planet_type, mass_wrt, radius_wrt, detection_method
## dbl (8): distance, stellar_magnitude, discovery_year, mass_multiplier, radiu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(nasa_data)
## # A tibble: 6 × 13
##   name     distance stellar_magnitude planet_type discovery_year mass_multiplier
##   <chr>       <dbl>             <dbl> <chr>                <dbl>           <dbl>
## 1 11 Coma…      304              4.72 Gas Giant             2007           19.4 
## 2 11 Ursa…      409              5.01 Gas Giant             2009           14.7 
## 3 14 Andr…      246              5.23 Gas Giant             2008            4.8 
## 4 14 Herc…       58              6.62 Gas Giant             2002            8.14
## 5 16 Cygn…       69              6.22 Gas Giant             1996            1.78
## 6 17 Scor…      408              5.23 Gas Giant             2020            4.32
## # ℹ 7 more variables: mass_wrt <chr>, radius_multiplier <dbl>,
## #   radius_wrt <chr>, orbital_radius <dbl>, orbital_period <dbl>,
## #   eccentricity <dbl>, detection_method <chr>

Hypothesis One

Research Question:

Do Neptune-like planets have a larger orbital radius than Super Earth planets?

Variables

Outcome variable (continuous): Orbital radius

Group 1: Neptune-like planets

Group 2: Super Earth planets

Null Hypothesis: There is no difference in the average orbital radius between Neptune-like planets and Super Earth planets.

Alternative Hypothesis: Neptune-like planets have a larger average orbital radius than Super Earth planets.

Significance Level (Alpha) Alpha is set to 0.05.

Power is set to 0.80.

Neyman-Pearson Framework

nasa_data |>
 filter(planet_type %in% c("Neptune-like", "Super Earth")) |>
 filter(orbital_radius > 0) |>
 ggplot() +
 geom_boxplot(
  mapping = aes(
    x = factor(planet_type, levels = c("Super Earth", "Neptune-like")),
    y = orbital_radius
  ),
  notch = TRUE,           
  fill = "skyblue",       
  outlier.alpha = 0.2)+   
  
  scale_y_log10() +
 labs(
   title = "Orbital Radius of Neptune-like vs Super Earth Planets",
   x = "Planet Type",
   y = "Orbital Radius (AU, Log Scale)"
 ) +
 theme_minimal()

I created the visualizations beforehand to explore the data and assess whether this was a meaningful hypothesis to investigate. The plots show a clear difference in orbital radius between Neptune-like and Super Earth planets. This visual exploration guided the hypothesis and provided preliminary evidence that the null hypothesis of no difference could be rejected.

or_fd <- nasa_data |> 
  filter(planet_type %in% c("Neptune-like", "Super Earth")) |>
  filter(orbital_radius > 0) |> 
  drop_na(orbital_radius)

or_fd$planet_type <- factor(or_fd$planet_type, levels = c("Neptune-like", "Super Earth"))


t_test_results <- t.test(
  orbital_radius ~ planet_type, 
  data = or_fd, 
  alternative = "greater", 
  var.equal = FALSE
)

nasa_sd <- sd(or_fd$orbital_radius, na.rm = TRUE)



s_n <- pwrss.t.2means(mu1 = 0.5,
                            sd1 = nasa_sd,         
                            kappa = 1,             
                            power = 0.95, 
                            alpha = 0.05,         
                            alternative = "greater")
## +--------------------------------------------------+
## |             SAMPLE SIZE CALCULATION              |
## +--------------------------------------------------+
## 
## Welch's T-Test (Independent Samples)
## 
## ---------------------------------------------------
## Hypotheses
## ---------------------------------------------------
##   H0 (Null Claim) : d - null.d <= 0 
##   H1 (Alt. Claim) : d - null.d > 0 
## 
## ---------------------------------------------------
## Results
## ---------------------------------------------------
##   Sample Size            = 13 and 13  <<
##   Type 1 Error (alpha)   = 0.050
##   Type 2 Error (beta)    = 0.037
##   Statistical Power      = 0.963
plot(s_n)

t_test_results
## 
##  Welch Two Sample t-test
## 
## data:  orbital_radius by planet_type
## t = 9.2206, df = 3171, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Neptune-like and group Super Earth is greater than 0
## 95 percent confidence interval:
##  0.09443775        Inf
## sample estimates:
## mean in group Neptune-like  mean in group Super Earth 
##                  0.2249017                  0.1099521

The power analysis determined that a minimum of 13 observations per group were required to achieve 95% statistical power at α = 0.05. The analyzed dataset included 13 observations per group, resulting in achieved power of 0.963, slightly exceeding the target power.

At the α = 0.05 significance level, we reject the null hypothesis. The p-value (< 2.2e -16) is substantially smaller than 0.05, indicating strong statistical evidence that Neptune-like planets have a larger average orbital radius than Super Earth planets.

The figure displays two probability density curves representing the sampling distributions under the null and alternative hypotheses. The small red shaded region under the null distribution corresponds to the Type I error rate (α = 0.05), which is the probability of incorrectly rejecting a true null hypothesis. The large gray shaded area under the alternative represents statistical power (0.963. The minimal overlap between the curves indicates a low Type II error rate (β ≈ 0.04), which shows the study has high power (≈ 0.96) to detect the assumed effect size

Hypothesis Two

Research question: Do Neptune-like planets have more elliptical orbits than Super Earth planets?

Null Hypothesis: There is no difference in mean orbital eccentricity between Neptune-like and Super Earth planets.

Alternative Hypothesis: Neptune-like planets have higher mean orbital eccentricity than Super Earth planets.

ecc_fd <- nasa_data |> 
  filter(planet_type %in% c("Neptune-like", "Super Earth")) |> 
  drop_na(eccentricity) 


ecc_summary <- ecc_fd |> 
  group_by(planet_type) |> 
  summarise(
    mean_orbit = mean(eccentricity, na.rm = TRUE),
    sd_orbit = sd(eccentricity, na.rm = TRUE),
    n = n()
  )

print(ecc_summary)
## # A tibble: 2 × 4
##   planet_type  mean_orbit sd_orbit     n
##   <chr>             <dbl>    <dbl> <int>
## 1 Neptune-like     0.0330   0.0926  1825
## 2 Super Earth      0.0172   0.0620  1595
ecc_fd$planet_type <- factor(ecc_fd$planet_type, levels = c("Neptune-like", "Super Earth"))

t_test_results <- t.test(
  eccentricity ~ planet_type,
  data = ecc_fd,
  alternative = "greater",
  var.equal = FALSE
)

t_test_results
## 
##  Welch Two Sample t-test
## 
## data:  eccentricity by planet_type
## t = 5.9425, df = 3209.5, p-value = 1.554e-09
## alternative hypothesis: true difference in means between group Neptune-like and group Super Earth is greater than 0
## 95 percent confidence interval:
##  0.01146117        Inf
## sample estimates:
## mean in group Neptune-like  mean in group Super Earth 
##                 0.03300422                 0.01715473

A Welch two-sample t-test compared orbital eccentricity between Neptune-like and Super Earth planets. The p-value (1.554e-09) is much smaller than 0.05, providing strong evidence against the null hypothesis. Neptune-like planets have a higher mean eccentricity (0.0330) than Super Earth planets (0.0172).

nasa_data |> 
  filter(planet_type %in% c("Neptune-like", "Super Earth")) |> 
  drop_na(eccentricity) |>                     
  ggplot() +
  geom_boxplot(
    mapping = aes(
      x = factor(planet_type, levels = c("Super Earth", "Neptune-like")),
      y = eccentricity
    ),
    notch = TRUE,
    fill = "skyblue",
    outlier.alpha = 0.2
  ) +
  labs(
    title = "Orbital Eccentricity of Neptune-like vs Super Earth Planets",
    x = "Planet Type",
    y = "Eccentricity"
  ) +
  theme_minimal()

It is difficult to distinguish the differences visually because most planets in this dataset have very low eccentricities, causing the boxes to sit near the bottom of the scale. However, the underlying data confirms a clear trend, Neptune-like planets have a higher mean eccentricity (0.0330) compared to Super Earth planets (0.0172).