library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
nasa_data <- read_delim("C:/Users/imaya/Downloads/cleaned_5250.csv",delim = ",")
## Rows: 5250 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, planet_type, mass_wrt, radius_wrt, detection_method
## dbl (8): distance, stellar_magnitude, discovery_year, mass_multiplier, radiu...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(nasa_data)
## # A tibble: 6 × 13
## name distance stellar_magnitude planet_type discovery_year mass_multiplier
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 11 Coma… 304 4.72 Gas Giant 2007 19.4
## 2 11 Ursa… 409 5.01 Gas Giant 2009 14.7
## 3 14 Andr… 246 5.23 Gas Giant 2008 4.8
## 4 14 Herc… 58 6.62 Gas Giant 2002 8.14
## 5 16 Cygn… 69 6.22 Gas Giant 1996 1.78
## 6 17 Scor… 408 5.23 Gas Giant 2020 4.32
## # ℹ 7 more variables: mass_wrt <chr>, radius_multiplier <dbl>,
## # radius_wrt <chr>, orbital_radius <dbl>, orbital_period <dbl>,
## # eccentricity <dbl>, detection_method <chr>
nasa_data %>%
select(mass_multiplier,mass_wrt,radius_multiplier,radius_wrt,)
## # A tibble: 5,250 × 4
## mass_multiplier mass_wrt radius_multiplier radius_wrt
## <dbl> <chr> <dbl> <chr>
## 1 19.4 Jupiter 1.08 Jupiter
## 2 14.7 Jupiter 1.09 Jupiter
## 3 4.8 Jupiter 1.15 Jupiter
## 4 8.14 Jupiter 1.12 Jupiter
## 5 1.78 Jupiter 1.2 Jupiter
## 6 4.32 Jupiter 1.15 Jupiter
## 7 10.3 Jupiter 1.11 Jupiter
## 8 8 Jupiter 1.66 Jupiter
## 9 0.91 Jupiter 1.24 Jupiter
## 10 1.99 Jupiter 1.19 Jupiter
## # ℹ 5,240 more rows
The variables mass_multiplier and radius_multiplier were initially unclear in terms of how the values were being calculated or used without reading the documentation. Additionally, mass_wrt and radius_wrt were confusing because the names make it sound like they refer to weight, when they actually represent the mass of the planet in comparison to planets in our solar system. According to the documentation, the mass is noted in comparison to a planet in our solar system, meaning the multiplier shows how many times larger the planet is compared to that reference plane
I think the data was encoded this way to make planetary measurements more standardized and easier to compare. Using a reference planet like Jupiter avoids having to list extremely large numbers and makes comparisons more manageable. Without reading the documentation, I might have assumed that mass_multiplier alone represented the planet’s full mass, or misunderstood mass_wrt as weight rather than a reference unit. That could have led to incorrect conclusions about the size of the planets.
After reading the documentation, one element that is still unclear is that the rounding is inconsistent across some columns. For example, mass_multiplier values are reported with multiple decimal places (like 8.13881), while other values in the same column are sometimes rounded to just one or two decimal places. This inconsistency also occurs in other columns. It could affect calculations or comparisons between planets, especially if precise ratios are important, and it is not clear from the documentation why some values are rounded differently from others.
nasa_data <- nasa_data %>% mutate(cdp_mm = ifelse(is.na(mass_multiplier), NA_real_,nchar(sub(".*\\.", "", as.character(mass_multiplier)))))
vis1 <- ggplot(nasa_data[is.finite(nasa_data$cdp_mm), ], aes(x = cdp_mm)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
labs(title = "Histogram of Decimal Places in mass_multiplier",
x = "Number of Decimal Places",
y = "Count of Planets")
vis1
vis2 <- ggplot(nasa_data[is.finite(nasa_data$cdp_mm), ],
aes(x = factor(cdp_mm), y = seq_along(mass_multiplier))) +
geom_point(color = "pink", size = 2, position = position_jitter(width = 0.2)) +
labs(title = "Scatterplot of Planets by Decimal Places in mass_multiplier",
x = "Number of Decimal Places",
y = "Planet Count") + theme_minimal()
vis2
cmt <- function(x) {
if (is.na(x)) {return("NA")} else if (x == "") {return("Empty")} else if (tolower(as.character(x)) == "unknown") {return("Unknown")} else {return(NA)}}
colc <- c("planet_type", "detection_method")
m_info <- nasa_data %>%
mutate(row_number = row_number()) %>%
rowwise() %>% mutate(mpt = cmt(planet_type),mdm = cmt(detection_method)) %>% ungroup() %>% filter(!is.na(mpt) | !is.na(mdm)) %>% select(row_number, mpt, mdm)
m_info
## # A tibble: 5 × 3
## row_number mpt mdm
## <int> <chr> <lgl>
## 1 4475 Unknown NA
## 2 4476 Unknown NA
## 3 4477 Unknown NA
## 4 4575 Unknown NA
## 5 4576 Unknown NA
I found that some rows are missing data for planet_type (mpt) and detection_method (mdm), as shown in the table. These are examples of explicitly missing values (NA) and implicitly missing values (“Unknown”).
Since some planets don’t have a planet type listed, other related columns like mass_multiplier, mass_wrt, radius_multiplier, and radius_wrt are also missing. This can be considered part of empty groups, as the data for those planets is not documented. I looked up one of these planets in the NASA Exoplanet Archive and found that it was detected by imaging, but the rest of the information is still unknown.
iqr_orb <- IQR(nasa_data$orbital_radius, na.rm = TRUE)
q1 <- quantile(nasa_data$orbital_radius, 0.25, na.rm = TRUE)
q3 <- quantile(nasa_data$orbital_radius, 0.75, na.rm = TRUE)
lower_bound <- q1 - (1.5 * iqr_orb)
upper_bound <- q3 + (1.5 * iqr_orb)
outliers <- nasa_data |>
filter(orbital_radius < lower_bound | orbital_radius > upper_bound)
print(nrow(outliers))
## [1] 870
outliers
## # A tibble: 870 × 14
## name distance stellar_magnitude planet_type discovery_year mass_multiplier
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 11 Com… 304 4.72 Gas Giant 2007 19.4
## 2 11 Urs… 409 5.01 Gas Giant 2009 14.7
## 3 14 And… 246 5.23 Gas Giant 2008 4.8
## 4 14 Her… 58 6.62 Gas Giant 2002 8.14
## 5 16 Cyg… 69 6.22 Gas Giant 1996 1.78
## 6 17 Sco… 408 5.23 Gas Giant 2020 4.32
## 7 18 Del… 249 5.51 Gas Giant 2008 10.3
## 8 1RXS J… 454 12.6 Gas Giant 2008 8
## 9 24 Sex… 235 6.45 Gas Giant 2010 1.99
## 10 24 Sex… 235 6.45 Gas Giant 2010 0.86
## # ℹ 860 more rows
## # ℹ 8 more variables: mass_wrt <chr>, radius_multiplier <dbl>,
## # radius_wrt <chr>, orbital_radius <dbl>, orbital_period <dbl>,
## # eccentricity <dbl>, detection_method <chr>, cdp_mm <dbl>
The data appears to show that there are 870 outliers, which seems really high since I only have around 5,000 data entries. This may be due to the fact that the orbital radius is in AU, which results in a highly skewed distribution and makes many smaller deviations appear as outliers. It may need to be scaled or log-transformed to better identify the truly extreme values.
I would define an outlier as a planet with an orbital radius that is much larger than the typical range, such as the planet with a value of 330 AU. While many of the smaller values are flagged by the IQR method, the planet with 330 AU is extremely high compared to most planets, which generally have orbital radius below 10 AU.