Data Dive Nine

Regression Diagnostics

Load library

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lindia)
## Warning: package 'lindia' was built under R version 4.5.3
library(broom)
## Warning: package 'broom' was built under R version 4.5.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.5.2
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.5.2
library(effsize)
## Warning: package 'effsize' was built under R version 4.5.3
library(GGally)
## Warning: package 'GGally' was built under R version 4.5.3
library(dplyr)
library(ggplot2)

Load NASA data

nasa_data <- read_delim("C:/Users/imaya/Downloads/cleaned_5250.csv",delim = ",")
## Rows: 5250 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, planet_type, mass_wrt, radius_wrt, detection_method
## dbl (8): distance, stellar_magnitude, discovery_year, mass_multiplier, radiu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(nasa_data)
## # A tibble: 6 × 13
##   name     distance stellar_magnitude planet_type discovery_year mass_multiplier
##   <chr>       <dbl>             <dbl> <chr>                <dbl>           <dbl>
## 1 11 Coma…      304              4.72 Gas Giant             2007           19.4 
## 2 11 Ursa…      409              5.01 Gas Giant             2009           14.7 
## 3 14 Andr…      246              5.23 Gas Giant             2008            4.8 
## 4 14 Herc…       58              6.62 Gas Giant             2002            8.14
## 5 16 Cygn…       69              6.22 Gas Giant             1996            1.78
## 6 17 Scor…      408              5.23 Gas Giant             2020            4.32
## # ℹ 7 more variables: mass_wrt <chr>, radius_multiplier <dbl>,
## #   radius_wrt <chr>, orbital_radius <dbl>, orbital_period <dbl>,
## #   eccentricity <dbl>, detection_method <chr>

Notes: For this analysis we will examine the variables mass_multiplier and mass_wrt. The mass_multiplier represents the numerical value of the planet’s mass, while mass_wrt indicates the unit the mass is measured relative to (either Jupiter or Earth). The planet’s total mass is therefore interpreted as the multiplier times the reference unit.

To make comparisons easier across planets, all masses were standardized to Jupiter masses. This was done because some planets in the data set are measured relative to Earth’s mass, while others are measured relative to Jupiter’s mass. According to standard astronomical conversions, 1 Jupiter mass is about 317.77 Earth masses. Therefore, when a planet’s mass is given relative to Earth, it can be converted to Jupiter masses by dividing by 317.77

Source: https://www.unitsconverters.com/en/Jupitermass-To-Massofearth/Unittounit-6003-173

nasa_data$mass_jupiter <- ifelse(nasa_data$mass_wrt=="Jupiter",
                           nasa_data$mass_multiplier,
                           nasa_data$mass_multiplier / 317.8)

Linear Regression Model

For week 8, I built a linear regression model to estimate how planetary mass (in Jupiter units) is associated with orbital radius. The original model and summary are shown below.

model <- lm(mass_jupiter ~ orbital_radius, data = nasa_data)
model$coefficients
##    (Intercept) orbital_radius 
##    1.468470125    0.003873673
summary(model)
## 
## Call:
## lm(formula = mass_jupiter ~ orbital_radius, data = nasa_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.24  -1.46  -1.44  -0.94 750.33 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.468470   0.172875   8.494  < 2e-16 ***
## orbital_radius 0.003874   0.001243   3.117  0.00184 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.14 on 4941 degrees of freedom
##   (307 observations deleted due to missingness)
## Multiple R-squared:  0.001962,   Adjusted R-squared:  0.00176 
## F-statistic: 9.713 on 1 and 4941 DF,  p-value: 0.00184

Adding Eccentricity

The next variable I added to the model is eccentricity. Eccentricity measures how elliptical a planet’s orbit is, with lower values indicating more circular orbits and higher values indicating more elongated orbits. The coefficient for eccentricity is about 7.5, which means that, holding orbital radius constant, planets with more elongated orbits tend to have higher mass on average. The model also suggests that planets farther from their stars tend to be slightly more massive, although this effect is small.

However, these relationships are weak, and the model does not explain much of the variation in planet mass. Including eccentricity improved the model’s R-squared value from 0.002 to 0.0099, indicating a slight improvement in explanatory power. Despite this improvement, the overall R-squared remains low, suggesting that these variables explain only a small portion of the variation in planetary mass. Therefore, eccentricity is retained in the model, but additional variables may be needed to better explain planetary mass.

model_2 <- lm(mass_jupiter ~ orbital_radius + eccentricity, data = nasa_data)

summary(model_2)
## 
## Call:
## lm(formula = mass_jupiter ~ orbital_radius + eccentricity, data = nasa_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.28  -0.97  -0.96  -0.87 750.82 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.973469   0.189368   5.141 2.84e-07 ***
## orbital_radius 0.003944   0.001238   3.185  0.00146 ** 
## eccentricity   7.511883   1.195446   6.284 3.59e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.09 on 4940 degrees of freedom
##   (307 observations deleted due to missingness)
## Multiple R-squared:  0.009876,   Adjusted R-squared:  0.009475 
## F-statistic: 24.64 on 2 and 4940 DF,  p-value: 2.256e-11

Adding orbital_period

The next variable I tested in the model is orbital period, which measures the time a planet takes to complete one orbit around its star. While orbital period is significant in the regression (coefficient ≈ -0.00015, p < 0.001), it is highly correlated with orbital radius, which introduces multicollinearity. Including orbital period slightly changes the coefficients of other variables and does not meaningfully improve model fit (R-squared increases only from 0.0099 to 0.014). Due to its high correlation with orbital radius and minimal contribution to explaining variation in planetary mass, orbital period is not retained in the final model.

model_2 <- lm(mass_jupiter ~ orbital_radius + eccentricity + orbital_period, data = nasa_data)

summary(model_2)
## 
## Call:
## lm(formula = mass_jupiter ~ orbital_radius + eccentricity + orbital_period, 
##     data = nasa_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.66  -0.93  -0.91  -0.84 749.93 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.9289788  0.1891877   4.910 9.38e-07 ***
## orbital_radius  0.0223307  0.0040634   5.496 4.09e-08 ***
## eccentricity    7.4347800  1.1929558   6.232 4.98e-10 ***
## orbital_period -0.0001548  0.0000326  -4.750 2.09e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.07 on 4939 degrees of freedom
##   (307 observations deleted due to missingness)
## Multiple R-squared:  0.01438,    Adjusted R-squared:  0.01378 
## F-statistic: 24.02 on 3 and 4939 DF,  p-value: 2e-15

Adding Detection Method as a Binary Variable

The next variable I added to the model is the detection method, converted into a binary variable indicating whether the planet was discovered using the transit method (1) or another method (0). The transit method identifies planets when they pass in front of their stars, causing a periodic dip in observed light. In the regression, the coefficient for Transit is approximately -3.09, which means that, holding orbital radius and eccentricity constant, planets discovered via the transit method tend to have lower masses compared to planets detected by other methods. Including this variable slightly improved the model’s explanatory power, increasing R-squared from 0.0099 to 0.0193.

top_method <- nasa_data %>% count(detection_method, sort =TRUE) %>% slice_head(n=1)
print(top_method)
## # A tibble: 1 × 2
##   detection_method     n
##   <chr>            <int>
## 1 Transit           3945
nasa_data$Transit <- ifelse(nasa_data$detection_method == "Transit", 1, 0)
model_3 <- lm(mass_jupiter ~ orbital_radius + eccentricity + Transit,data = nasa_data)

summary(model_3)
## 
## Call:
## lm(formula = mass_jupiter ~ orbital_radius + eccentricity + Transit, 
##     data = nasa_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20.41  -0.56  -0.48  -0.45 748.25 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.582843   0.423825   8.454  < 2e-16 ***
## orbital_radius  0.003082   0.001239   2.488   0.0129 *  
## eccentricity    3.078161   1.353465   2.274   0.0230 *  
## Transit        -3.092191   0.449844  -6.874 7.02e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.04 on 4939 degrees of freedom
##   (307 observations deleted due to missingness)
## Multiple R-squared:  0.01926,    Adjusted R-squared:  0.01866 
## F-statistic: 32.33 on 3 and 4939 DF,  p-value: < 2.2e-16

Evaluating the Model

Residuals vs. Fitted Values

gg_resfitted(model_3) +
  geom_point(na.rm = TRUE) +
  geom_smooth(se = FALSE, linewidth = 1, method = "lm", na.rm = TRUE) +
  coord_cartesian(xlim = c(0, 20), ylim = c(-250, 250)) +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

The Residual vs. Fitted Value plot confirms a violation of the constant variance assumption. There is a distinct increase in the variance of the residuals at low fitted values.

Residuals vs. X Values

plots <-gg_resX(model_3, plot.all = FALSE)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plots$orbital_radius + geom_smooth(se = FALSE, linewidth = 1, method = "lm", na.rm = TRUE) +
  coord_cartesian(xlim = c(0, 800), ylim = c(-150, 150)) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The plot violates the assumption of constant variance because the residuals are much larger and more volatile at smaller orbital radius than they are at larger distances, indicating heteroscedasticity, meaning that the values are not constant

Correlation Heatmap

ggcorr(select(nasa_data, mass_jupiter, orbital_radius, eccentricity, orbital_period), label = TRUE) + labs(title = 'Correlation Heatmap') + theme_minimal()

The correlation heatmap reveals a perfect positive relationship 1.0 between orbital_radius and orbital_period. This is expected since both variables essentially measure the scale of a planet’s orbit. To avoid issues with multicollinearity, one of these variables should be removed from the model. As they are redundant, including both makes it impossible for the model to distinguish their individual effects, leading to unstable and unreliable results.

QQ- Plots

gg_qqplot
## function (fitted.lm, scale.factor = 1) 
## {
##     handle_exception(fitted.lm, "gg_qqplot")
##     res = residuals(fitted.lm)
##     slope = (quantile(res, 0.75) - quantile(res, 0.25))/(qnorm(0.75) - 
##         qnorm(0.25))
##     intercept = quantile(res, 0.25) - slope * qnorm(0.25)
##     qq_line = data.frame(intercept = intercept, slope = slope)
##     qq_plot <- ggplot(data = fitted.lm) + stat_qq(aes(sample = res), 
##         size = scale.factor) + labs(x = "Theoretical Quantile", 
##         y = "Standardized Residual") + geom_abline(data = qq_line, 
##         aes(intercept = intercept, slope = slope), color = "indianred3", 
##         size = scale.factor) + ggtitle("Normal-QQ Plot")
##     qq_plot
## }
## <bytecode: 0x0000027354e74d58>
## <environment: namespace:lindia>
(model_3)
## 
## Call:
## lm(formula = mass_jupiter ~ orbital_radius + eccentricity + Transit, 
##     data = nasa_data)
## 
## Coefficients:
##    (Intercept)  orbital_radius    eccentricity         Transit  
##       3.582843        0.003082        3.078161       -3.092191

The Normal Q-Q Plot indicates that the data violates the normality assumption. While the middle of the data follows the theoretical line, there is a sharply curved right tail. This suggests that there are extreme outliers, like the Gas Giants that were identified using Cook’s D, that have much higher residuals than a normal distribution would predict.

Cook’s D By Observation

 gg_cooksd(model_3, threshold ='matlab')

In the cook’s D plot, each number represents a row of data. Rows 84 and 436 are high-valued points that need to be investigated.

nasa_clean <- nasa_data %>%
  filter(
    !is.na(orbital_radius),
    !is.na(mass_jupiter),
    !is.na(planet_type)
  )


ggplot(data = slice(nasa_clean, c(84, 436))) +
  geom_point(data = nasa_clean, 
             aes(x = orbital_radius, y = mass_jupiter)) +
  geom_point(aes(x = orbital_radius, y = mass_jupiter),
             color = "darkred") +
  geom_text_repel(aes(x = orbital_radius, y = mass_jupiter,
                      label = name),
                  color = "darkred") +
  labs(title = "Investigating High Influence Points",
       subtitle = "Label = name") +
  theme_minimal()

Most planets in the NASA data set are small and close to their stars; these two Gas Giants stand out because they are extremes:

  • Coconuts-2b: This planet is an outlier because of its Mass. It sits way at the top of the graph, showing it has much more weight (6.3x Jupiter) than other planets at a similar distance.

  • HD 100546 b: This planet is an outlier because of its Distance. It sits far to the right, showing an orbital radius (53 AU) that is much larger than the average planet.

The plot violates the I.I.D. assumption (Independent and Identically Distributed) because these two Gas Giants act as extreme outliers in mass and radius; this creates a fluctuation in variance across the graph.