Data Dive Eight

Regression Modeling

Load library

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)
library(pwrss)

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

Load NASA data

nasa_data <- read_delim("C:/Users/imaya/Downloads/cleaned_5250.csv",delim = ",")

## Rows: 5250 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, planet_type, mass_wrt, radius_wrt, detection_method
## dbl (8): distance, stellar_magnitude, discovery_year, mass_multiplier, radiu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(nasa_data)

## # A tibble: 6 × 13
##   name     distance stellar_magnitude planet_type discovery_year mass_multiplier
##   <chr>       <dbl>             <dbl> <chr>                <dbl>           <dbl>
## 1 11 Coma…      304              4.72 Gas Giant             2007           19.4 
## 2 11 Ursa…      409              5.01 Gas Giant             2009           14.7 
## 3 14 Andr…      246              5.23 Gas Giant             2008            4.8 
## 4 14 Herc…       58              6.62 Gas Giant             2002            8.14
## 5 16 Cygn…       69              6.22 Gas Giant             1996            1.78
## 6 17 Scor…      408              5.23 Gas Giant             2020            4.32
## # ℹ 7 more variables: mass_wrt <chr>, radius_multiplier <dbl>,
## #   radius_wrt <chr>, orbital_radius <dbl>, orbital_period <dbl>,
## #   eccentricity <dbl>, detection_method <chr>

Notes: For this analysis we will examine the variables mass_multiplier and mass_wrt. The mass_multiplier represents the numerical value of the planet’s mass, while mass_wrt indicates the unit the mass is measured relative to (either Jupiter or Earth). The planet’s total mass is therefore interpreted as the multiplier times the reference unit.

To make comparisons easier across planets, all masses were standardized to Jupiter masses. This was done because some planets in the dataset are measured relative to Earth’s mass, while others are measured relative to Jupiter’s mass. According to standard astronomical conversions, 1 Jupiter mass is about 317.77 Earth masses. Therefore, when a planet’s mass is given relative to Earth, it can be converted to Jupiter masses by dividing by 317.77

Source: https://www.unitsconverters.com/en/Jupitermass-To-Massofearth/Unittounit-6003-173

nasa_data$mass_jupiter <- ifelse(nasa_data$mass_wrt=="Jupiter",
                           nasa_data$mass_multiplier,
                           nasa_data$mass_multiplier / 317.8)

ANOVA Testing

Null hypothesis (H0): Detection methods have no effect on the average planet mass (all group means are equal)
Alternative hypothesis (H1): At least one detection method has a different average planet mass.

This ANOVA test is used to determine whether the method used to detect a planet is associated with differences in its average mass.

anova_model <- aov(mass_jupiter ~ detection_method, data = nasa_data)
summary(anova_model)

##                    Df Sum Sq Mean Sq F value Pr(>F)    
## detection_method   10  40824    4082   30.84 <2e-16 ***
## Residuals        5216 690362     132                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 23 observations deleted due to missingness

The ANOVA model summary suggests that we reject the null hypothesis, as at least one detection method has a different mean planet mass. At a significance level of α = 0.05, the p-value is less than 2 × 10⁻¹⁶, indicating strong statistical significance.

This suggests that different detection methods are associated with different planet characteristics. It is possible that certain methods are more likely to detect planets of particular masses. Additionally, the time period in which a detection method was developed may influence the types of planets discovered. For example, in the early years of planet detection, higher-mass planets were easier to detect than lower-mass planets.

Some detection methods may also be better suited for discovering higher-mass planets. For instance, the transit method detects planets by observing dips in a star’s brightness, while the astrometry method detects the wobble of a star caused by an orbiting planet. In general, larger or more massive planets produce stronger signals, making them easier to detect with these methods.

In contrast, methods such as the radial velocity method may be more sensitive to smaller-mass planets, as they measure the motion of a star caused by the gravitational pull of orbiting planets. With modern technology, this method can detect smaller-mass planets, although larger planets still produce stronger signals. These differences in detection sensitivity across methods may have contributed to rejecting the null hypothesis, as each method has its own strengths.

par(mar = c(5, 4, 4, 8), xpd = TRUE)

nasa_data$detection_method_factor <- factor(nasa_data$detection_method)

stripchart(mass_jupiter ~ detection_method_factor,
           data = nasa_data,
           method = "jitter",
           pch = 19,
           col = "skyblue",
           vertical = TRUE,
           ylim = c(0, 200),
           main = "Planet Mass by Detection Method",
           xlab = "Detection Method (Numbered)",
           ylab = "Mass (Jupiter Masses)",
           xaxt = "n")  

axis(1, at = 1:length(levels(nasa_data$detection_method_factor)), 
     labels = 1:length(levels(nasa_data$detection_method_factor)))

legend("topright",
       inset = c(-0.33, 0),
       legend = paste(1:length(levels(nasa_data$detection_method_factor)),
                      levels(nasa_data$detection_method_factor),
                      sep = " = "),
       cex = 0.5)

aggregate(mass_jupiter ~ detection_method, data = nasa_data, mean)

##                 detection_method mass_jupiter
## 1                     Astrometry   15.3800000
## 2                 Direct Imaging   24.9369472
## 3                Disk Kinematics    2.5000000
## 4      Eclipse Timing Variations    6.7760176
## 5     Gravitational Microlensing    2.3483584
## 6  Orbital Brightness Modulation    1.1022409
## 7                  Pulsar Timing    0.6467092
## 8    Pulsation Timing Variations    7.5000000
## 9                Radial Velocity    3.2745846
## 10                       Transit    0.5427572
## 11     Transit Timing Variations    1.4515515

avg_mass <- tapply(nasa_data$mass_jupiter,
                   nasa_data$detection_method,
                   mean)

boxplot(mass_jupiter ~ detection_method_factor,
        data = nasa_data,
        col = "skyblue",
        main = "Planet Mass by Detection Method",
        ylab = "Mass (Jupiter Masses)",
        ylim = c(0, 50),
        xaxt = "n") 


axis(1,
     at = 1:length(levels(nasa_data$detection_method_factor)),
     labels = 1:length(levels(nasa_data$detection_method_factor)),
     las = 1,      
     cex.axis = 0.8)

legend("topright",
       inset = c(-0.32, 0),  
       legend = paste(1:length(levels(nasa_data$detection_method_factor)),
                      levels(nasa_data$detection_method_factor), sep = " = "),
       cex = 0.5,
       bty = "n")

The boxplot and strip chart help visualize the distribution and mean planet mass for each detection method. The strip chart shows the number of planets detected by each method, which highlights a potential source of bias in the ANOVA analysis: some detection methods, such as astronomy, have relatively few detected planets compared to others.

The boxplot focuses on the distribution of planet masses for each method. It shows that the mean mass differs between detection methods, with some methods tending to detect lower-mass planets and others higher-mass planets. This supports the conclusion from the ANOVA test that the null hypothesis can be rejected, as the differences in mean planet mass across detection methods are statistically significant.

Linear Regression Model

I built a linear regression model to estimate how planetary mass (in Jupiter units) is associated with orbital radius.

nasa_data <- na.omit(nasa_data[, c("mass_jupiter", "orbital_radius")])

nasa_data |>
  ggplot(aes(x = orbital_radius, y = mass_jupiter)) +
  geom_point(size = 2) +
  ylim(0, 150) +
  geom_smooth(method = "lm", se = FALSE, color = "skyblue")

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

The relationship between planet mass (mass_jupiter) and orbital radius appears roughly linear, as indicated by the upward-sloping trend line. However, most of the data points are clustered in the lower-left corner of the plot, with planet masses below 50 Jupiter masses. This means that while the linear trend captures the overall direction, it does not fully represent the distribution of the majority of planets.

model <- lm(mass_jupiter ~ orbital_radius, data = nasa_data)
model$coefficients

##    (Intercept) orbital_radius 
##    1.468470125    0.003873673

summary(model)

## 
## Call:
## lm(formula = mass_jupiter ~ orbital_radius, data = nasa_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.24  -1.46  -1.44  -0.94 750.33 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.468470   0.172875   8.494  < 2e-16 ***
## orbital_radius 0.003874   0.001243   3.117  0.00184 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.14 on 4941 degrees of freedom
## Multiple R-squared:  0.001962,   Adjusted R-squared:  0.00176 
## F-statistic: 9.713 on 1 and 4941 DF,  p-value: 0.00184

The linear regression model has a very small coefficient (0.003874), which is reflected in the graph, where most points are clustered in the lower-left corner. It is not a good model, as it has a very low R2 despite a significant p-value. This indicates that orbital radius is a poor predictor of planet mass. Interestingly, one might expect more massive planets to be more influenced by the star’s gravity, since gravitational pull depends on mass. However, this relationship does not appear in the data.

DataDiveEight

2026-02-17