Data Dive Eight

Regression Modeling

Load library

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(pwrss)
## Warning: package 'pwrss' was built under R version 4.5.2
## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

Load NASA data

nasa_data <- read_delim("C:/Users/imaya/Downloads/cleaned_5250.csv",delim = ",")
## Rows: 5250 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, planet_type, mass_wrt, radius_wrt, detection_method
## dbl (8): distance, stellar_magnitude, discovery_year, mass_multiplier, radiu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(nasa_data)
## # A tibble: 6 × 13
##   name     distance stellar_magnitude planet_type discovery_year mass_multiplier
##   <chr>       <dbl>             <dbl> <chr>                <dbl>           <dbl>
## 1 11 Coma…      304              4.72 Gas Giant             2007           19.4 
## 2 11 Ursa…      409              5.01 Gas Giant             2009           14.7 
## 3 14 Andr…      246              5.23 Gas Giant             2008            4.8 
## 4 14 Herc…       58              6.62 Gas Giant             2002            8.14
## 5 16 Cygn…       69              6.22 Gas Giant             1996            1.78
## 6 17 Scor…      408              5.23 Gas Giant             2020            4.32
## # ℹ 7 more variables: mass_wrt <chr>, radius_multiplier <dbl>,
## #   radius_wrt <chr>, orbital_radius <dbl>, orbital_period <dbl>,
## #   eccentricity <dbl>, detection_method <chr>

Notes: For this analysis we will examine the variables mass_multiplier and mass_wrt. The mass_multiplier represents the numerical value of the planet’s mass, while mass_wrt indicates the unit the mass is measured relative to (either Jupiter or Earth). The planet’s total mass is therefore interpreted as the multiplier times the reference unit.

To make comparisons easier across planets, all masses were standardized to Jupiter masses. This was done because some planets in the dataset are measured relative to Earth’s mass, while others are measured relative to Jupiter’s mass. According to standard astronomical conversions, 1 Jupiter mass is about 317.77 Earth masses. Therefore, when a planet’s mass is given relative to Earth, it can be converted to Jupiter masses by dividing by 317.77

Source : https://www.unitsconverters.com/en/Jupitermass-To-Massofearth/Unittounit-6003-173

lm_model <- lm(mass_multiplier ~ orbital_radius, data =nasa_data)

summary(lm_model)
## 
## Call:
## lm(formula = mass_multiplier ~ orbital_radius, data = nasa_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -16.26  -4.65  -2.28   1.62 745.47 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.411902   0.188514  34.013   <2e-16 ***
## orbital_radius 0.002151   0.001355   1.587    0.113    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.24 on 4941 degrees of freedom
##   (307 observations deleted due to missingness)
## Multiple R-squared:  0.0005096,  Adjusted R-squared:  0.0003073 
## F-statistic: 2.519 on 1 and 4941 DF,  p-value: 0.1125
nasa_data$mass_jupiter <- ifelse(nasa_data$mass_wrt == "Jupiter",
                           nasa_data$mass_multiplier,
                           nasa_data$mass_multiplier / 317.8)


anova_model <- aov(mass_jupiter ~ detection_method, data = nasa_data)
summary(anova_model)
##                    Df Sum Sq Mean Sq F value Pr(>F)    
## detection_method   10  40824    4082   30.84 <2e-16 ***
## Residuals        5216 690362     132                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 23 observations deleted due to missingness
boxplot(mass_jupiter ~ detection_method,
        data = nasa_data,
        col="lightblue",
        main="Planet Mass by Detection Method",
        xlab="Detection Method",
        ylab="Mass (Jupiter Masses)")

ecc_fd <- nasa_data |> 
  filter(planet_type %in% c("Neptune-like", "Super Earth")) |> 
  drop_na(eccentricity) 


ecc_summary <- ecc_fd |> 
  group_by(planet_type) |> 
  summarise(
    mean_orbit = mean(eccentricity, na.rm = TRUE),
    sd_orbit = sd(eccentricity, na.rm = TRUE),
    n = n()
  )

print(ecc_summary)
## # A tibble: 2 × 4
##   planet_type  mean_orbit sd_orbit     n
##   <chr>             <dbl>    <dbl> <int>
## 1 Neptune-like     0.0330   0.0926  1825
## 2 Super Earth      0.0172   0.0620  1595
ecc_fd$planet_type <- factor(ecc_fd$planet_type, levels = c("Neptune-like", "Super Earth"))

t_test_results <- t.test(
  eccentricity ~ planet_type,
  data = ecc_fd,
  alternative = "greater",
  var.equal = FALSE
)

t_test_results
## 
##  Welch Two Sample t-test
## 
## data:  eccentricity by planet_type
## t = 5.9425, df = 3209.5, p-value = 1.554e-09
## alternative hypothesis: true difference in means between group Neptune-like and group Super Earth is greater than 0
## 95 percent confidence interval:
##  0.01146117        Inf
## sample estimates:
## mean in group Neptune-like  mean in group Super Earth 
##                 0.03300422                 0.01715473

A Welch two-sample t-test compared orbital eccentricity between Neptune-like and Super Earth planets. The p-value (1.554e-09) is much smaller than 0.05, providing strong evidence against the null hypothesis. Neptune-like planets have a higher mean eccentricity (0.0330) than Super Earth planets (0.0172).

nasa_data |> 
  filter(planet_type %in% c("Neptune-like", "Super Earth")) |> 
  drop_na(eccentricity) |>                     
  ggplot() +
  geom_boxplot(
    mapping = aes(
      x = factor(planet_type, levels = c("Super Earth", "Neptune-like")),
      y = eccentricity
    ),
    notch = TRUE,
    fill = "skyblue",
    outlier.alpha = 0.2
  ) +
  labs(
    title = "Orbital Eccentricity of Neptune-like vs Super Earth Planets",
    x = "Planet Type",
    y = "Eccentricity"
  ) +
  theme_minimal()

It is difficult to distinguish the differences visually because most planets in this dataset have very low eccentricities, causing the boxes to sit near the bottom of the scale. However, the underlying data confirms a clear trend, Neptune-like planets have a higher mean eccentricity (0.0330) compared to Super Earth planets (0.0172).