library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lindia)
## Warning: package 'lindia' was built under R version 4.5.3
library(broom)
## Warning: package 'broom' was built under R version 4.5.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.5.2
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.5.2
library(effsize)
## Warning: package 'effsize' was built under R version 4.5.3
library(GGally)
## Warning: package 'GGally' was built under R version 4.5.3
library(dplyr)
library(ggplot2)
nasa_data <- read_delim("C:/Users/imaya/Downloads/cleaned_5250.csv",delim = ",")
## Rows: 5250 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, planet_type, mass_wrt, radius_wrt, detection_method
## dbl (8): distance, stellar_magnitude, discovery_year, mass_multiplier, radiu...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(nasa_data)
## # A tibble: 6 × 13
## name distance stellar_magnitude planet_type discovery_year mass_multiplier
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 11 Coma… 304 4.72 Gas Giant 2007 19.4
## 2 11 Ursa… 409 5.01 Gas Giant 2009 14.7
## 3 14 Andr… 246 5.23 Gas Giant 2008 4.8
## 4 14 Herc… 58 6.62 Gas Giant 2002 8.14
## 5 16 Cygn… 69 6.22 Gas Giant 1996 1.78
## 6 17 Scor… 408 5.23 Gas Giant 2020 4.32
## # ℹ 7 more variables: mass_wrt <chr>, radius_multiplier <dbl>,
## # radius_wrt <chr>, orbital_radius <dbl>, orbital_period <dbl>,
## # eccentricity <dbl>, detection_method <chr>
Notes: For this analysis we will examine the variables mass_multiplier and mass_wrt. The mass_multiplier represents the numerical value of the planet’s mass, while mass_wrt indicates the unit the mass is measured relative to (either Jupiter or Earth). The planet’s total mass is therefore interpreted as the multiplier times the reference unit.
To make comparisons easier across planets, all masses were standardized to Jupiter masses. This was done because some planets in the data set are measured relative to Earth’s mass, while others are measured relative to Jupiter’s mass. According to standard astronomical conversions, 1 Jupiter mass is about 317.77 Earth masses. Therefore, when a planet’s mass is given relative to Earth, it can be converted to Jupiter masses by dividing by 317.77
Source: https://www.unitsconverters.com/en/Jupitermass-To-Massofearth/Unittounit-6003-173
nasa_data$mass_jupiter <- ifelse(nasa_data$mass_wrt=="Jupiter",
nasa_data$mass_multiplier,
nasa_data$mass_multiplier / 317.8)
For week 8, I built a linear regression model to estimate how planetary mass (in Jupiter units) is associated with orbital radius. The original model and summary are shown below.
model <- lm(mass_jupiter ~ orbital_radius, data = nasa_data)
model$coefficients
## (Intercept) orbital_radius
## 1.468470125 0.003873673
summary(model)
##
## Call:
## lm(formula = mass_jupiter ~ orbital_radius, data = nasa_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.24 -1.46 -1.44 -0.94 750.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.468470 0.172875 8.494 < 2e-16 ***
## orbital_radius 0.003874 0.001243 3.117 0.00184 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.14 on 4941 degrees of freedom
## (307 observations deleted due to missingness)
## Multiple R-squared: 0.001962, Adjusted R-squared: 0.00176
## F-statistic: 9.713 on 1 and 4941 DF, p-value: 0.00184
The next variable I added to the model is eccentricity. Eccentricity measures how elliptical a planet’s orbit is, with lower values indicating more circular orbits and higher values indicating more elongated orbits. The coefficient for eccentricity is about 7.5, which means that, holding orbital radius constant, planets with more elongated orbits tend to have higher mass on average. The model also suggests that planets farther from their stars tend to be slightly more massive, although this effect is small.
However, these relationships are weak, and the model does not explain much of the variation in planet mass. Including eccentricity improved the model’s R-squared value from 0.002 to 0.0099, indicating a slight improvement in explanatory power. Despite this improvement, the overall R-squared remains low, suggesting that these variables explain only a small portion of the variation in planetary mass. Therefore, eccentricity is retained in the model, but additional variables may be needed to better explain planetary mass.
model_2 <- lm(mass_jupiter ~ orbital_radius + eccentricity, data = nasa_data)
summary(model_2)
##
## Call:
## lm(formula = mass_jupiter ~ orbital_radius + eccentricity, data = nasa_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.28 -0.97 -0.96 -0.87 750.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.973469 0.189368 5.141 2.84e-07 ***
## orbital_radius 0.003944 0.001238 3.185 0.00146 **
## eccentricity 7.511883 1.195446 6.284 3.59e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.09 on 4940 degrees of freedom
## (307 observations deleted due to missingness)
## Multiple R-squared: 0.009876, Adjusted R-squared: 0.009475
## F-statistic: 24.64 on 2 and 4940 DF, p-value: 2.256e-11
The next variable I tested in the model is orbital period, which measures the time a planet takes to complete one orbit around its star. While orbital period is significant in the regression (coefficient ≈ -0.00015, p < 0.001), it is highly correlated with orbital radius, which introduces multicollinearity. Including orbital period slightly changes the coefficients of other variables and does not meaningfully improve model fit (R-squared increases only from 0.0099 to 0.014). Due to its high correlation with orbital radius and minimal contribution to explaining variation in planetary mass, orbital period is not retained in the final model.
model_2 <- lm(mass_jupiter ~ orbital_radius + eccentricity + orbital_period, data = nasa_data)
summary(model_2)
##
## Call:
## lm(formula = mass_jupiter ~ orbital_radius + eccentricity + orbital_period,
## data = nasa_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.66 -0.93 -0.91 -0.84 749.93
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9289788 0.1891877 4.910 9.38e-07 ***
## orbital_radius 0.0223307 0.0040634 5.496 4.09e-08 ***
## eccentricity 7.4347800 1.1929558 6.232 4.98e-10 ***
## orbital_period -0.0001548 0.0000326 -4.750 2.09e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.07 on 4939 degrees of freedom
## (307 observations deleted due to missingness)
## Multiple R-squared: 0.01438, Adjusted R-squared: 0.01378
## F-statistic: 24.02 on 3 and 4939 DF, p-value: 2e-15
The next variable I added to the model is the detection method, converted into a binary variable indicating whether the planet was discovered using the transit method (1) or another method (0). The transit method identifies planets when they pass in front of their stars, causing a periodic dip in observed light. In the regression, the coefficient for Transit is approximately -3.09, which means that, holding orbital radius and eccentricity constant, planets discovered via the transit method tend to have lower masses compared to planets detected by other methods. Including this variable slightly improved the model’s explanatory power, increasing R-squared from 0.0099 to 0.0193.
top_method <- nasa_data %>% count(detection_method, sort =TRUE) %>% slice_head(n=1)
print(top_method)
## # A tibble: 1 × 2
## detection_method n
## <chr> <int>
## 1 Transit 3945
nasa_data$Transit <- ifelse(nasa_data$detection_method == "Transit", 1, 0)
model_3 <- lm(mass_jupiter ~ orbital_radius + eccentricity + Transit,data = nasa_data)
summary(model_3)
##
## Call:
## lm(formula = mass_jupiter ~ orbital_radius + eccentricity + Transit,
## data = nasa_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.41 -0.56 -0.48 -0.45 748.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.582843 0.423825 8.454 < 2e-16 ***
## orbital_radius 0.003082 0.001239 2.488 0.0129 *
## eccentricity 3.078161 1.353465 2.274 0.0230 *
## Transit -3.092191 0.449844 -6.874 7.02e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.04 on 4939 degrees of freedom
## (307 observations deleted due to missingness)
## Multiple R-squared: 0.01926, Adjusted R-squared: 0.01866
## F-statistic: 32.33 on 3 and 4939 DF, p-value: < 2.2e-16
gg_resfitted(model_3) +
geom_point(na.rm = TRUE) +
geom_smooth(se = FALSE, linewidth = 1, method = "lm", na.rm = TRUE) +
coord_cartesian(xlim = c(0, 20), ylim = c(-250, 250)) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the lindia package.
## Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
The Residual vs. Fitted Value plot confirms a violation of the constant variance assumption. There is a distinct increase in the variance of the residuals at low fitted values.
plots <-gg_resX(model_3, plot.all = FALSE)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the lindia package.
## Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plots$orbital_radius + geom_smooth(se = FALSE, linewidth = 1, method = "lm", na.rm = TRUE) +
coord_cartesian(xlim = c(0, 800), ylim = c(-150, 150)) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The plot violates the assumption of constant variance because the residuals are much larger and more volatile at smaller orbital radius than they are at larger distances, indicating heteroscedasticity, meaning that the values are not constant
ggcorr(select(nasa_data, mass_jupiter, orbital_radius, eccentricity, orbital_period), label = TRUE) + labs(title = 'Correlation Heatmap') + theme_minimal()
The correlation heatmap reveals a perfect positive relationship 1.0 between orbital_radius and orbital_period. This is expected since both variables essentially measure the scale of a planet’s orbit. To avoid issues with multicollinearity, one of these variables should be removed from the model. As they are redundant, including both makes it impossible for the model to distinguish their individual effects, leading to unstable and unreliable results.
gg_qqplot
## function (fitted.lm, scale.factor = 1)
## {
## handle_exception(fitted.lm, "gg_qqplot")
## res = residuals(fitted.lm)
## slope = (quantile(res, 0.75) - quantile(res, 0.25))/(qnorm(0.75) -
## qnorm(0.25))
## intercept = quantile(res, 0.25) - slope * qnorm(0.25)
## qq_line = data.frame(intercept = intercept, slope = slope)
## qq_plot <- ggplot(data = fitted.lm) + stat_qq(aes(sample = res),
## size = scale.factor) + labs(x = "Theoretical Quantile",
## y = "Standardized Residual") + geom_abline(data = qq_line,
## aes(intercept = intercept, slope = slope), color = "indianred3",
## size = scale.factor) + ggtitle("Normal-QQ Plot")
## qq_plot
## }
## <bytecode: 0x0000027354e74d58>
## <environment: namespace:lindia>
(model_3)
##
## Call:
## lm(formula = mass_jupiter ~ orbital_radius + eccentricity + Transit,
## data = nasa_data)
##
## Coefficients:
## (Intercept) orbital_radius eccentricity Transit
## 3.582843 0.003082 3.078161 -3.092191
The Normal Q-Q Plot indicates that the data violates the normality assumption. While the middle of the data follows the theoretical line, there is a sharply curved right tail. This suggests that there are extreme outliers, like the Gas Giants that were identified using Cook’s D, that have much higher residuals than a normal distribution would predict.
gg_cooksd(model_3, threshold ='matlab')
In the cook’s D plot, each number represents a row of data. Rows 84 and 436 are high-valued points that need to be investigated.
nasa_clean <- nasa_data %>%
filter(
!is.na(orbital_radius),
!is.na(mass_jupiter),
!is.na(planet_type)
)
ggplot(data = slice(nasa_clean, c(84, 436))) +
geom_point(data = nasa_clean,
aes(x = orbital_radius, y = mass_jupiter)) +
geom_point(aes(x = orbital_radius, y = mass_jupiter),
color = "darkred") +
geom_text_repel(aes(x = orbital_radius, y = mass_jupiter,
label = name),
color = "darkred") +
labs(title = "Investigating High Influence Points",
subtitle = "Label = name") +
theme_minimal()
Most planets in the NASA data set are small and close to their stars; these two Gas Giants stand out because they are extremes:
Coconuts-2b: This planet is an outlier because of its Mass. It sits way at the top of the graph, showing it has much more weight (6.3x Jupiter) than other planets at a similar distance.
HD 100546 b: This planet is an outlier because of its Distance. It sits far to the right, showing an orbital radius (53 AU) that is much larger than the average planet.
The plot violates the I.I.D. assumption (Independent and Identically Distributed) because these two Gas Giants act as extreme outliers in mass and radius; this creates a fluctuation in variance across the graph.