Data Dive Eleven

GLMs Part 2

Load library

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(lindia)
library(broom)
library(ggrepel)
library(ggthemes)
library(effsize)
library(GGally)
library(ggplot2)

Load NASA data

nasa_data <- read_delim("C:/Users/imaya/Downloads/cleaned_5250.csv",delim = ",")

## Rows: 5250 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, planet_type, mass_wrt, radius_wrt, detection_method
## dbl (8): distance, stellar_magnitude, discovery_year, mass_multiplier, radiu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(nasa_data)

## # A tibble: 6 × 13
##   name     distance stellar_magnitude planet_type discovery_year mass_multiplier
##   <chr>       <dbl>             <dbl> <chr>                <dbl>           <dbl>
## 1 11 Coma…      304              4.72 Gas Giant             2007           19.4 
## 2 11 Ursa…      409              5.01 Gas Giant             2009           14.7 
## 3 14 Andr…      246              5.23 Gas Giant             2008            4.8 
## 4 14 Herc…       58              6.62 Gas Giant             2002            8.14
## 5 16 Cygn…       69              6.22 Gas Giant             1996            1.78
## 6 17 Scor…      408              5.23 Gas Giant             2020            4.32
## # ℹ 7 more variables: mass_wrt <chr>, radius_multiplier <dbl>,
## #   radius_wrt <chr>, orbital_radius <dbl>, orbital_period <dbl>,
## #   eccentricity <dbl>, detection_method <chr>

Linear Model

I developed a linear model to examine the relationship between stellar magnitude and distance, while controlling for discovery year and eccentricity. The reasoning is that detection methods have improved over time, and different eras relied on different technologies, which directly affect how distant or faint a planet can be and still be observed. I also included orbital eccentricity in the model. Planets with highly elliptical orbits experience changes in both distance and brightness relative to their star, depending on their position at the time of observation. Accounting for eccentricity helps reduce noise in the data that may arise from variations in orbital position during detection.

model <- lm(distance ~ stellar_magnitude + discovery_year + eccentricity, data = nasa_data)
summary(model)

## 
## Call:
## lm(formula = distance ~ stellar_magnitude + discovery_year + 
##     eccentricity, data = nasa_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4759.7  -747.6  -128.4   567.5 23330.3 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       83841.325   9521.926   8.805   <2e-16 ***
## stellar_magnitude   360.777      7.244  49.803   <2e-16 ***
## discovery_year      -42.985      4.731  -9.086   <2e-16 ***
## eccentricity       -239.598    155.412  -1.542    0.123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1397 on 5073 degrees of freedom
##   (173 observations deleted due to missingness)
## Multiple R-squared:  0.3877, Adjusted R-squared:  0.3874 
## F-statistic:  1071 on 3 and 5073 DF,  p-value: < 2.2e-16

coef(summary(model))

##                      Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)       83841.32530 9521.926348  8.805080 1.766424e-18
## stellar_magnitude   360.77681    7.244121 49.802703 0.000000e+00
## discovery_year      -42.98468    4.730974 -9.085799 1.446625e-19
## eccentricity       -239.59773  155.412001 -1.541694 1.232104e-01

The linear regression shows that stellar magnitude has a positive coefficient (360.777). Higher stellar magnitude values correspond to dimmer stars, this suggests that systems appearing dimmer are predicted to be at greater distances. This aligns with the physical relationship between distance and apparent brightness, though it may also reflect observational limitations in detecting distant systems.

Discovery year has a negative coefficient (-42.985), indicating that more recently discovered planets in this dataset tend to be closer rather than farther away. This contrasts with the expectation that improved detection methods would enable the discovery of more distant systems and may reflect sampling or observational biases in the dataset.

Eccentricity has a negative coefficient (-239.598), but with a p-value of 0.123, it is not statistically significant. This suggests that orbital shape does not have a reliable linear effect on distance in this model.

exp(coef(model))

##       (Intercept) stellar_magnitude    discovery_year      eccentricity 
##               Inf     4.823680e+156      2.147782e-19     8.790758e-105

Exponentiating the logistic regression coefficients gives odds ratios. The exponentiated intercept is effectively infinite, which suggests that the baseline odds of detecting a planet (when all predictors are zero) are extremely large. However, this value is not practically meaningful and likely indicates instability in the model, such as separation or scaling issues in the data. Stellar magnitude has an extremely large odds ratio (approximately 4.82e+156, indicating that even small increases in stellar magnitude are associated with a massive increase in the odds of detection. This result is not realistic and suggests that the model is being driven by numerical instability or extreme values rather than a true underlying relationship. The discovery year has an odds ratio close to zero 2.14e-19, implying that more recent discoveries are associated with drastically lower odds of detection. This is counterintuitive and, combined with the other extreme coefficients, further suggests that the model may be poorly specified or suffering from scaling issues. Eccentricity also has an odds ratio extremely close to zero 8.79e-105, indicating a strong negative relationship with detection probability. However, given the magnitude of this value and the instability seen across coefficients, this result is likely not reliable.

Since eccentricity is not statistically significant in the model (p = 0.123), it does not provide meaningful explanatory power and can reasonably be removed to simplify the model. However, discovery year remains highly significant (p < 2e-16), indicating that it contributes important information. Therefore, it should be retained in the model despite its relatively small coefficient.

model2 <- lm(distance ~ stellar_magnitude + discovery_year, data = nasa_data)
summary(model2)

## 
## Call:
## lm(formula = distance ~ stellar_magnitude + discovery_year, data = nasa_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4776.5  -749.7  -130.6   574.0 23319.5 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       82881.097   9502.823   8.722   <2e-16 ***
## stellar_magnitude   365.807      6.469  56.552   <2e-16 ***
## discovery_year      -42.548      4.723  -9.008   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1397 on 5074 degrees of freedom
##   (173 observations deleted due to missingness)
## Multiple R-squared:  0.3875, Adjusted R-squared:  0.3872 
## F-statistic:  1605 on 2 and 5074 DF,  p-value: < 2.2e-16

Evaluating the Model

Residuals vs. Fitted Values

gg_resfitted(model2) +
  geom_point(alpha = 0.4, na.rm = TRUE) + 
  geom_smooth(se = FALSE, linewidth = 1, method = "lm", color = "red", na.rm = TRUE) +
  coord_cartesian(xlim = c(0, 5000), ylim = c(-5000, 5000)) + 
  labs(title = "Residuals vs Fitted: NASA Exoplanet Model")+
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The residuals vs. fitted plot shows a triangular shape, where the spread of residuals increases as fitted values increase. This indicates heteroscedasticity, meaning the model’s errors are not constant across all levels of predicted distance. Specifically, the model appears to be less accurate for larger predicted values, suggesting that it performs better for nearby planetary systems than for distant ones.

Histogram

gg_reshist(model2)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

QQ- Plots

gg_qqplot(model2)

## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggplot2 package.
##   Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The histogram of residuals indicates that the distribution is not normal, as it exhibits a pronounced right tail (positive skew). This is further supported by the Normal Q-Q plot, where the residuals deviate from the reference line and curve sharply upward in the upper tail. This pattern suggests that the model is not accurately capturing the distribution of errors, particularly for certain groups of exoplanets, and may be underperforming for larger values. As a result, the normality assumption of the residuals is violated.

 gg_cooksd(model2, threshold ='matlab')

nasa_clean <- nasa_data %>%
  filter(
    !is.na(discovery_year),
    !is.na(stellar_magnitude),
  )


ggplot(data = slice(nasa_clean, c(4529, 4574,4618,4619))) +
  geom_point(data = nasa_clean, 
             aes(x = stellar_magnitude, y = discovery_year)) +
  geom_point(aes(x = stellar_magnitude, y = discovery_year),
             color = "darkred") +
  geom_text_repel(aes(x = stellar_magnitude, y = discovery_year,
                      label = name),
                  color = "darkred") +
  labs(title = "Investigating High Influence Points",
       subtitle = "Label = name") +
  theme_minimal()

DataDiveten

2026-03-30