Importing necessary libraries
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ lubridate 1.9.4 ✔ tibble 3.3.1
## ✔ purrr 1.2.1 ✔ tidyr 1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
nasa_data <- read_delim("C:/Users/imaya/Downloads/cleaned_5250.csv",delim = ",")
## Rows: 5250 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, planet_type, mass_wrt, radius_wrt, detection_method
## dbl (8): distance, stellar_magnitude, discovery_year, mass_multiplier, radiu...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(nasa_data)
## # A tibble: 6 × 13
## name distance stellar_magnitude planet_type discovery_year mass_multiplier
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 11 Coma… 304 4.72 Gas Giant 2007 19.4
## 2 11 Ursa… 409 5.01 Gas Giant 2009 14.7
## 3 14 Andr… 246 5.23 Gas Giant 2008 4.8
## 4 14 Herc… 58 6.62 Gas Giant 2002 8.14
## 5 16 Cygn… 69 6.22 Gas Giant 1996 1.78
## 6 17 Scor… 408 5.23 Gas Giant 2020 4.32
## # ℹ 7 more variables: mass_wrt <chr>, radius_multiplier <dbl>,
## # radius_wrt <chr>, orbital_radius <dbl>, orbital_period <dbl>,
## # eccentricity <dbl>, detection_method <chr>
summary_type <- nasa_data |> count(planet_type, sort = TRUE)
summary_type
## # A tibble: 5 × 2
## planet_type n
## <chr> <int>
## 1 Neptune-like 1825
## 2 Gas Giant 1630
## 3 Super Earth 1595
## 4 Terrestrial 195
## 5 Unknown 5
unique_types <- unique(nasa_data$planet_type)
unique_types
## [1] "Gas Giant" "Super Earth" "Neptune-like" "Terrestrial" "Unknown"
The planet_type summary provides insight into which planets are most commonly discovered, such as Neptune-like planets (1825) and Gas Giants (1630), and which are less commonly discovered, such as Terrestrial planets (195). The column contains five unique values: “Gas Giant,” “Super Earth,” “Neptune-like,” “Terrestrial,” and “Unknown.”
# Summary stats for distance
summary_distance <- data.frame(statistics = c("Minimum","Maximum","Mean","Median"), Value = c(
min(nasa_data$distance, na.rm = TRUE),
max(nasa_data$distance, na.rm = TRUE),
mean(nasa_data$distance, na.rm = TRUE),
median(nasa_data$distance, na.rm = TRUE)
)
)
#Quantile stats for distance
quantile_distance <- data.frame(quantile = c("0%","25%","50%", "75%", "100%"),
Value = quantile(nasa_data$distance, na.rm =TRUE)
)
#Print results
summary_distance
## statistics Value
## 1 Minimum 4.000
## 2 Maximum 27727.000
## 3 Mean 2167.169
## 4 Median 1371.000
quantile_distance
## quantile Value
## 0% 0% 4
## 25% 25% 389
## 50% 50% 1371
## 75% 75% 2779
## 100% 100% 27727
The summary statistics for exoplanet distance (in light-years) provide insight into the range and distribution of discoveries. The closest exoplanet is 4 light-years away, while the farthest is 27,727 light-years away. The mean distance is 2,167 light-years, and the median distance is 1,371 light-years. The quantiles show that 25% of exoplanets are within 389 light-years, 50% are within 1,371 light-years, 75% are within 2,779 light-years, and 100% are within 27,727 light-years. In the future, I might explore in more depth how many exoplanets lie within different distance ranges, such as within 100, 500, 1,000, or 5,000 light-years of Earth.
# Summary statistics for discovery years
summary_year <- data.frame(
Statistic = c("Minimum", "Maximum", "Mean", "Median", "Mode"),
Value = c(
min(nasa_data$discovery_year, na.rm = TRUE),
max(nasa_data$discovery_year, na.rm = TRUE),
mean(nasa_data$discovery_year, na.rm = TRUE),
median(nasa_data$discovery_year, na.rm = TRUE),
as.numeric(names(which.max(table(nasa_data$discovery_year))))
)
)
# Quantiles for discovery years
quantile_year <- data.frame(
Quantile = c("0%", "25%", "50%", "75%", "100%"),
Value = quantile(nasa_data$discovery_year, na.rm = TRUE)
)
# Counts per year
counts_byyear <- as.data.frame(table(nasa_data$discovery_year))
colnames(counts_byyear) <- c("Year", "Count")
# Print results
summary_year
## Statistic Value
## 1 Minimum 1992.000
## 2 Maximum 2023.000
## 3 Mean 2015.732
## 4 Median 2016.000
## 5 Mode 2016.000
quantile_year
## Quantile Value
## 0% 0% 1992
## 25% 25% 2014
## 50% 50% 2016
## 75% 75% 2018
## 100% 100% 2023
counts_byyear
## Year Count
## 1 1992 2
## 2 1994 1
## 3 1995 1
## 4 1996 6
## 5 1997 1
## 6 1998 6
## 7 1999 13
## 8 2000 16
## 9 2001 12
## 10 2002 29
## 11 2003 22
## 12 2004 27
## 13 2005 36
## 14 2006 31
## 15 2007 52
## 16 2008 65
## 17 2009 94
## 18 2010 97
## 19 2011 138
## 20 2012 138
## 21 2013 126
## 22 2014 875
## 23 2015 157
## 24 2016 1517
## 25 2017 153
## 26 2018 326
## 27 2019 203
## 28 2020 234
## 29 2021 525
## 30 2022 338
## 31 2023 9
The summary statistics for the discovery year show the range of years over which exoplanets have been discovered, from 1992 to 2023. The mean and median discovery years are around 2015–2016, and 2016 is the most common year for discoveries. The quantiles indicate that 25% of discoveries occurred by 2014, 50% by 2016, and 75% by 2018. Also, the year-by-year count data frame highlights the number of discoveries in each year. In the future, I might explore the data in more depth to understand why some years had spikes in discoveries while others had very few.
#Distance by year
distance_byyear <- nasa_data |>
group_by(discovery_year) |>
summarise(
mean_distance = mean(distance, na.rm = TRUE),
median_distance = median(distance, na.rm = TRUE)
)
# Distance counts by decade
#ly = light year
counts_bydecade <- nasa_data |>
mutate(
decade = floor(discovery_year / 10) * 10,
distance_group = ifelse(distance > 1000, ">1000 ly", "≤1000 ly")
) |>
group_by(decade, distance_group) |>
summarise(
n_planets = n(),
.groups = "drop"
) |>
pivot_wider(
names_from = distance_group,
values_from = n_planets,
values_fill = 0
)
# Print results
distance_byyear
## # A tibble: 31 × 3
## discovery_year mean_distance median_distance
## <dbl> <dbl> <dbl>
## 1 1992 1957 1957
## 2 1994 1957 1957
## 3 1995 50 50
## 4 1996 51.3 48
## 5 1997 57 57
## 6 1998 91.8 96
## 7 1999 97.1 95
## 8 2000 102. 106.
## 9 2001 130. 120
## 10 2002 171 125
## # ℹ 21 more rows
counts_bydecade
## # A tibble: 4 × 4
## decade `>1000 ly` `≤1000 ly` `NA`
## <dbl> <int> <int> <int>
## 1 1990 3 27 0
## 2 2000 45 338 1
## 3 2010 2558 1170 2
## 4 2020 486 606 14
The data suggest that in some decades there has been an increase in the number of exoplanets discovered at greater distances. For example, the number of planets beyond 1,000 light-years rose from 3 in the 1990s to 45 in the 2000s. However, the pattern is not consistent across all decades; for instance, in the 2020s there were only 486 planets beyond 1,000 light-years compared to 606 within or equal to 1,000 light-years. This lower count may be influenced by the fact that only three years of data are available for the 2020s, which makes it difficult to determine a clear trend.
method_distance_counts <- nasa_data |>
mutate(distance_group = ifelse(distance > 1000, ">1000 ly", "≤1000 ly")) |>
group_by(detection_method, distance_group) |>
summarise(
n_planets = n(),
.groups = "drop"
) |>
pivot_wider(
names_from = distance_group,
values_from = n_planets,
values_fill = 0
)
method_distance_counts
## # A tibble: 11 × 4
## detection_method `≤1000 ly` `NA` `>1000 ly`
## <chr> <int> <int> <int>
## 1 Astrometry 2 0 0
## 2 Direct Imaging 60 2 0
## 3 Disk Kinematics 1 0 0
## 4 Eclipse Timing Variations 7 1 9
## 5 Gravitational Microlensing 0 2 152
## 6 Orbital Brightness Modulation 0 0 9
## 7 Pulsar Timing 1 1 5
## 8 Pulsation Timing Variations 0 0 2
## 9 Radial Velocity 977 0 50
## 10 Transit 1085 11 2849
## 11 Transit Timing Variations 8 0 16
There is a clear pattern between discovery method and distance. Planets located within 1,000 light-years are most commonly discovered using the Transit or Radial Velocity methods, while planets beyond 1,000 light-years are primarily discovered using the Transit method or Gravitational Microlensing.
#Discovery rate
discovery_rate <- nasa_data |>
group_by(discovery_year) |>
summarise(
n_planets = n(),
.groups ="drop"
)
# Filter discovery_rate for different decades
decade_1990 <- discovery_rate |> filter(discovery_year < 2000)
decade_2000 <- discovery_rate |> filter(discovery_year >= 2000 & discovery_year < 2010)
decade_2010 <- discovery_rate |> filter(discovery_year >= 2010 & discovery_year < 2020)
decade_2020 <- discovery_rate |> filter(discovery_year >= 2020)
#Print Results
decade_1990
## # A tibble: 7 × 2
## discovery_year n_planets
## <dbl> <int>
## 1 1992 2
## 2 1994 1
## 3 1995 1
## 4 1996 6
## 5 1997 1
## 6 1998 6
## 7 1999 13
decade_2000
## # A tibble: 10 × 2
## discovery_year n_planets
## <dbl> <int>
## 1 2000 16
## 2 2001 12
## 3 2002 29
## 4 2003 22
## 5 2004 27
## 6 2005 36
## 7 2006 31
## 8 2007 52
## 9 2008 65
## 10 2009 94
decade_2010
## # A tibble: 10 × 2
## discovery_year n_planets
## <dbl> <int>
## 1 2010 97
## 2 2011 138
## 3 2012 138
## 4 2013 126
## 5 2014 875
## 6 2015 157
## 7 2016 1517
## 8 2017 153
## 9 2018 326
## 10 2019 203
decade_2020
## # A tibble: 4 × 2
## discovery_year n_planets
## <dbl> <int>
## 1 2020 234
## 2 2021 525
## 3 2022 338
## 4 2023 9
The rate of exoplanet discoveries has steadily increased over the years. In the 1990s, most years had fewer than 10 discoveries, while in the 2000s, most years had around 30 discoveries. In the 2010s, most years saw about 100 discoveries, and although only three years of data are available for the 2020s, most of these years recorded over 200 discoveries. This leads me to wonder whether there have been additional discoveries in the 2020s, since the data currently only include the first few years of the decade.
discovery_ratebt <- nasa_data |>
group_by(discovery_year, planet_type) |>
summarise(
n_planets = n(),
.groups = "drop"
)
ggplot(discovery_ratebt, aes(x = discovery_year, y = n_planets, color = planet_type)) +
geom_line(linewidth = 1) +
labs(
title = "Exoplanet Discoveries Over Time by Planet Type",
x = "Year of Discovery",
y = "Number of Planets",
color = "Planet Type"
) +
theme_minimal()
The line plot shows the trend in discovery years across different planet types. It highlights a spike in 2016, particularly for Neptune-like and Super Earth planets. In the future, I would like to investigate further to understand why discoveries increased so sharply in 2016 compared to other years.
ggplot(nasa_data, aes(x = planet_type, fill = detection_method)) +
geom_bar(position = "dodge") +
theme_minimal() +
scale_fill_brewer(palette ='Set3')
labs(
x = "Planet Type",
y = "Number of Planets",
fill = "Detection Method"
)
## <ggplot2::labels> List of 3
## $ x : chr "Planet Type"
## $ y : chr "Number of Planets"
## $ fill: chr "Detection Method"
nasa_filtered <- nasa_data |>
group_by(detection_method) |>
filter(n() >= 100) |>
ungroup()
ggplot(
nasa_data |>
group_by(detection_method) |>
filter(n() >= 100) |>
ungroup(),
aes(x = planet_type, fill = detection_method)
) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set3") +
theme_minimal() +
labs(
x = "Planet Type",
y = "Number of Planets",
fill = "Detection Method"
)
The two bar plots show the distribution of planet types by discovery method. The first plot displays all detection methods, while the second plot focuses on the methods with the highest number of discoveries to provide a clearer view. Super Earth and Neptune-like planets are most commonly discovered using the Transit method, whereas Gas Giants are more often discovered using the Radial Velocity method.