Data Dive Week One

Importing necessary libraries

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.5.2

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'tidyr' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'purrr' was built under R version 4.5.2

## Warning: package 'dplyr' was built under R version 4.5.2

## Warning: package 'stringr' was built under R version 4.5.2

## Warning: package 'forcats' was built under R version 4.5.2

## Warning: package 'lubridate' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.4     ✔ tibble    3.3.1
## ✔ purrr     1.2.1     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the NASA data set

nasa_data <- read_delim("C:/Users/imaya/Downloads/cleaned_5250.csv",delim = ",")

## Rows: 5250 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, planet_type, mass_wrt, radius_wrt, detection_method
## dbl (8): distance, stellar_magnitude, discovery_year, mass_multiplier, radiu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data

head(nasa_data)

## # A tibble: 6 × 13
##   name     distance stellar_magnitude planet_type discovery_year mass_multiplier
##   <chr>       <dbl>             <dbl> <chr>                <dbl>           <dbl>
## 1 11 Coma…      304              4.72 Gas Giant             2007           19.4 
## 2 11 Ursa…      409              5.01 Gas Giant             2009           14.7 
## 3 14 Andr…      246              5.23 Gas Giant             2008            4.8 
## 4 14 Herc…       58              6.62 Gas Giant             2002            8.14
## 5 16 Cygn…       69              6.22 Gas Giant             1996            1.78
## 6 17 Scor…      408              5.23 Gas Giant             2020            4.32
## # ℹ 7 more variables: mass_wrt <chr>, radius_multiplier <dbl>,
## #   radius_wrt <chr>, orbital_radius <dbl>, orbital_period <dbl>,
## #   eccentricity <dbl>, detection_method <chr>

Summaries

Planet Type

summary_type <- nasa_data |>  count(planet_type, sort = TRUE)
summary_type

## # A tibble: 5 × 2
##   planet_type      n
##   <chr>        <int>
## 1 Neptune-like  1825
## 2 Gas Giant     1630
## 3 Super Earth   1595
## 4 Terrestrial    195
## 5 Unknown          5

unique_types <- unique(nasa_data$planet_type)
unique_types

## [1] "Gas Giant"    "Super Earth"  "Neptune-like" "Terrestrial"  "Unknown"

The planet_type summary provides insight into which planets are most commonly discovered, such as Neptune-like planets (1825) and Gas Giants (1630), and which are less commonly discovered, such as Terrestrial planets (195). The column contains five unique values: “Gas Giant,” “Super Earth,” “Neptune-like,” “Terrestrial,” and “Unknown.”

Distance in light-years stats

# Summary stats for distance
summary_distance <- data.frame(statistics = c("Minimum","Maximum","Mean","Median"), Value = c(
  min(nasa_data$distance, na.rm = TRUE),
  max(nasa_data$distance, na.rm = TRUE),
  mean(nasa_data$distance, na.rm = TRUE),
  median(nasa_data$distance, na.rm = TRUE)
 )
)

#Quantile stats for distance 
quantile_distance <- data.frame(quantile = c("0%","25%","50%", "75%", "100%"), 
   Value = quantile(nasa_data$distance, na.rm =TRUE)
 )


#Print results 
summary_distance

##   statistics     Value
## 1    Minimum     4.000
## 2    Maximum 27727.000
## 3       Mean  2167.169
## 4     Median  1371.000

quantile_distance

##      quantile Value
## 0%         0%     4
## 25%       25%   389
## 50%       50%  1371
## 75%       75%  2779
## 100%     100% 27727

The summary statistics for exoplanet distance (in light-years) provide insight into the range and distribution of discoveries. The closest exoplanet is 4 light-years away, while the farthest is 27,727 light-years away. The mean distance is 2,167 light-years, and the median distance is 1,371 light-years. The quantiles show that 25% of exoplanets are within 389 light-years, 50% are within 1,371 light-years, 75% are within 2,779 light-years, and 100% are within 27,727 light-years. In the future, I might explore in more depth how many exoplanets lie within different distance ranges, such as within 100, 500, 1,000, or 5,000 light-years of Earth.

Discovery year stats

# Summary statistics for discovery years
summary_year <- data.frame(
  Statistic = c("Minimum", "Maximum", "Mean", "Median", "Mode"),
  Value = c(
    min(nasa_data$discovery_year, na.rm = TRUE),
    max(nasa_data$discovery_year, na.rm = TRUE),
    mean(nasa_data$discovery_year, na.rm = TRUE),
    median(nasa_data$discovery_year, na.rm = TRUE),
    as.numeric(names(which.max(table(nasa_data$discovery_year))))
  )
)

# Quantiles for discovery years
quantile_year <- data.frame(
  Quantile = c("0%", "25%", "50%", "75%", "100%"),
  Value = quantile(nasa_data$discovery_year, na.rm = TRUE)
)

# Counts per year
counts_byyear <- as.data.frame(table(nasa_data$discovery_year))
colnames(counts_byyear) <- c("Year", "Count")


# Print results
summary_year

##   Statistic    Value
## 1   Minimum 1992.000
## 2   Maximum 2023.000
## 3      Mean 2015.732
## 4    Median 2016.000
## 5      Mode 2016.000

quantile_year

##      Quantile Value
## 0%         0%  1992
## 25%       25%  2014
## 50%       50%  2016
## 75%       75%  2018
## 100%     100%  2023

counts_byyear

##    Year Count
## 1  1992     2
## 2  1994     1
## 3  1995     1
## 4  1996     6
## 5  1997     1
## 6  1998     6
## 7  1999    13
## 8  2000    16
## 9  2001    12
## 10 2002    29
## 11 2003    22
## 12 2004    27
## 13 2005    36
## 14 2006    31
## 15 2007    52
## 16 2008    65
## 17 2009    94
## 18 2010    97
## 19 2011   138
## 20 2012   138
## 21 2013   126
## 22 2014   875
## 23 2015   157
## 24 2016  1517
## 25 2017   153
## 26 2018   326
## 27 2019   203
## 28 2020   234
## 29 2021   525
## 30 2022   338
## 31 2023     9

The summary statistics for the discovery year show the range of years over which exoplanets have been discovered, from 1992 to 2023. The mean and median discovery years are around 2015–2016, and 2016 is the most common year for discoveries. The quantiles indicate that 25% of discoveries occurred by 2014, 50% by 2016, and 75% by 2018. Also, the year-by-year count data frame highlights the number of discoveries in each year. In the future, I might explore the data in more depth to understand why some years had spikes in discoveries while others had very few.

Questions

Has there been an increase in the number of exoplanets discovered at greater distances?

#Distance by year 
distance_byyear <- nasa_data |> 
  group_by(discovery_year) |> 
  summarise(
    mean_distance = mean(distance, na.rm = TRUE),
    median_distance = median(distance, na.rm = TRUE)
  )
  
# Distance counts by decade 
#ly = light year 
counts_bydecade <- nasa_data |>
  mutate(
    decade = floor(discovery_year / 10) * 10,
    distance_group = ifelse(distance > 1000, ">1000 ly", "≤1000 ly")
  ) |>
  group_by(decade, distance_group) |>
  summarise(
    n_planets = n(),
    .groups = "drop"
  ) |>
  pivot_wider(
    names_from = distance_group,
    values_from = n_planets,
    values_fill = 0
  )

  
  
# Print results 
distance_byyear

## # A tibble: 31 × 3
##    discovery_year mean_distance median_distance
##             <dbl>         <dbl>           <dbl>
##  1           1992        1957             1957 
##  2           1994        1957             1957 
##  3           1995          50               50 
##  4           1996          51.3             48 
##  5           1997          57               57 
##  6           1998          91.8             96 
##  7           1999          97.1             95 
##  8           2000         102.             106.
##  9           2001         130.             120 
## 10           2002         171              125 
## # ℹ 21 more rows

counts_bydecade

## # A tibble: 4 × 4
##   decade `>1000 ly` `≤1000 ly`  `NA`
##    <dbl>      <int>      <int> <int>
## 1   1990          3         27     0
## 2   2000         45        338     1
## 3   2010       2558       1170     2
## 4   2020        486        606    14

The data suggest that in some decades there has been an increase in the number of exoplanets discovered at greater distances. For example, the number of planets beyond 1,000 light-years rose from 3 in the 1990s to 45 in the 2000s. However, the pattern is not consistent across all decades; for instance, in the 2020s there were only 486 planets beyond 1,000 light-years compared to 606 within or equal to 1,000 light-years. This lower count may be influenced by the fact that only three years of data are available for the 2020s, which makes it difficult to determine a clear trend.

Is there a pattern between discovery methods and the distance found?

method_distance_counts <- nasa_data |>
  mutate(distance_group = ifelse(distance > 1000, ">1000 ly", "≤1000 ly")) |>
  group_by(detection_method, distance_group) |>
  summarise(
    n_planets = n(),
    .groups = "drop"
  ) |>
  pivot_wider(
    names_from = distance_group,
    values_from = n_planets,
    values_fill = 0
  )

method_distance_counts

## # A tibble: 11 × 4
##    detection_method              `≤1000 ly`  `NA` `>1000 ly`
##    <chr>                              <int> <int>      <int>
##  1 Astrometry                             2     0          0
##  2 Direct Imaging                        60     2          0
##  3 Disk Kinematics                        1     0          0
##  4 Eclipse Timing Variations              7     1          9
##  5 Gravitational Microlensing             0     2        152
##  6 Orbital Brightness Modulation          0     0          9
##  7 Pulsar Timing                          1     1          5
##  8 Pulsation Timing Variations            0     0          2
##  9 Radial Velocity                      977     0         50
## 10 Transit                             1085    11       2849
## 11 Transit Timing Variations              8     0         16

There is a clear pattern between discovery method and distance. Planets located within 1,000 light-years are most commonly discovered using the Transit or Radial Velocity methods, while planets beyond 1,000 light-years are primarily discovered using the Transit method or Gravitational Microlensing.

Has the rate of discovery differed from 1992 to 2023?

#Discovery rate 
discovery_rate <- nasa_data |> 
group_by(discovery_year) |> 
summarise(
n_planets = n(),
.groups ="drop"
)

# Filter discovery_rate for different decades
decade_1990 <- discovery_rate |> filter(discovery_year < 2000)
decade_2000 <- discovery_rate |> filter(discovery_year >= 2000 & discovery_year < 2010)
decade_2010 <- discovery_rate |> filter(discovery_year >= 2010 & discovery_year < 2020)
decade_2020 <- discovery_rate |> filter(discovery_year >= 2020)

#Print Results 
decade_1990

## # A tibble: 7 × 2
##   discovery_year n_planets
##            <dbl>     <int>
## 1           1992         2
## 2           1994         1
## 3           1995         1
## 4           1996         6
## 5           1997         1
## 6           1998         6
## 7           1999        13

decade_2000

## # A tibble: 10 × 2
##    discovery_year n_planets
##             <dbl>     <int>
##  1           2000        16
##  2           2001        12
##  3           2002        29
##  4           2003        22
##  5           2004        27
##  6           2005        36
##  7           2006        31
##  8           2007        52
##  9           2008        65
## 10           2009        94

decade_2010

## # A tibble: 10 × 2
##    discovery_year n_planets
##             <dbl>     <int>
##  1           2010        97
##  2           2011       138
##  3           2012       138
##  4           2013       126
##  5           2014       875
##  6           2015       157
##  7           2016      1517
##  8           2017       153
##  9           2018       326
## 10           2019       203

decade_2020

## # A tibble: 4 × 2
##   discovery_year n_planets
##            <dbl>     <int>
## 1           2020       234
## 2           2021       525
## 3           2022       338
## 4           2023         9

The rate of exoplanet discoveries has steadily increased over the years. In the 1990s, most years had fewer than 10 discoveries, while in the 2000s, most years had around 30 discoveries. In the 2010s, most years saw about 100 discoveries, and although only three years of data are available for the 2020s, most of these years recorded over 200 discoveries. This leads me to wonder whether there have been additional discoveries in the 2020s, since the data currently only include the first few years of the decade.

Visualization

discovery_ratebt <- nasa_data |>
  group_by(discovery_year, planet_type) |>
  summarise(
    n_planets = n(),
    .groups = "drop"
  )

ggplot(discovery_ratebt, aes(x = discovery_year, y = n_planets, color = planet_type)) +
  geom_line(linewidth = 1) +       
  labs(
    title = "Exoplanet Discoveries Over Time by Planet Type",
    x = "Year of Discovery",
    y = "Number of Planets",
    color = "Planet Type"
  ) +
  theme_minimal()

The line plot shows the trend in discovery years across different planet types. It highlights a spike in 2016, particularly for Neptune-like and Super Earth planets. In the future, I would like to investigate further to understand why discoveries increased so sharply in 2016 compared to other years.

ggplot(nasa_data, aes(x = planet_type, fill = detection_method)) +
  geom_bar(position = "dodge") +
  theme_minimal() +
  scale_fill_brewer(palette ='Set3')

  labs(
    x = "Planet Type",
    y = "Number of Planets",
    fill = "Detection Method"
  )

## <ggplot2::labels> List of 3
##  $ x   : chr "Planet Type"
##  $ y   : chr "Number of Planets"
##  $ fill: chr "Detection Method"

  nasa_filtered <- nasa_data |>
  group_by(detection_method) |>
  filter(n() >= 100) |>
  ungroup()

  
  ggplot(
  nasa_data |>
    group_by(detection_method) |>
    filter(n() >= 100) |>
    ungroup(),
  aes(x = planet_type, fill = detection_method)
) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Set3") +
  theme_minimal() +
  labs(
    x = "Planet Type",
    y = "Number of Planets",
    fill = "Detection Method"
  )

The two bar plots show the distribution of planet types by discovery method. The first plot displays all detection methods, while the second plot focuses on the methods with the highest number of discoveries to provide a clearer view. Super Earth and Neptune-like planets are most commonly discovered using the Transit method, whereas Gas Giants are more often discovered using the Radial Velocity method.

Datadiveweekone

2026-01-20