title: “data_discovery” author: “Sneha” date: “2024-10-09” output: html_document

Data Documentation

This dataset contains publically available information about all astronauts who participated in space missions before 15 January 2020 collected from NASA, Roscosmos, and fun-made websites. The provided information includes full astronaut name, sex, date of birth, nationality, military status, a title and year of a selction program, and information about each mission completed by a particular astronaut such as a year, ascend and descend shuttle names, mission and extravehicular activity (EVAs) durations.

Possible Visualizations for Further Investigation

Network Graph of Astronaut Co-Missions

  • This visualization would allow us to see who traveled with whom and identify astronauts who frequently collaborate. Astronauts who traveled together multiple times will have stronger connections, and this could help uncover potential patterns in mission assignments or crew selection.

Timeline of Simultaneous Missions by Nation

  • This plot can help identify periods where missions from different countries overlapped, indicating either collaboration or simultaneous space exploration efforts. It can highlight joint missions or periods where different countries were engaged in competitive missions, revealing geopolitical trends in space travel.

Visualization :

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
astro <- read_delim('/Users/sneha/H510-Statistics/astronaut-data.csv')
## Rows: 1277 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, nationality, military_civilian, selection, occupation, ...
## dbl (13): id, number, nationwide_number, year_of_birth, year_of_selection, m...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
astronaut_data <- astro |>
  distinct(name,year_of_birth, .keep_all = TRUE)
astronaut_data
## # A tibble: 565 × 23
##       id number nationwide_number name           sex   year_of_birth nationality
##    <dbl>  <dbl>             <dbl> <chr>          <chr>         <dbl> <chr>      
##  1     1      1                 1 Gagarin, Yuri  male           1934 U.S.S.R/Ru…
##  2     2      2                 2 Titov, Gherman male           1935 U.S.S.R/Ru…
##  3     3      3                 1 Glenn, John H… male           1921 U.S.       
##  4     5      4                 2 Carpenter, M.… male           1925 U.S.       
##  5     6      5                 2 Nikolayev, An… male           1929 U.S.S.R/Ru…
##  6     8      6                 4 Popovich, Pav… male           1930 U.S.S.R/Ru…
##  7    10      7                 3 Schirra, Walt… male           1923 U.S.       
##  8    13      8                 4 Cooper, L. Go… male           1927 U.S.       
##  9    15      9                 5 Bykovsky, Val… male           1934 U.S.S.R/Ru…
## 10    18     10                 6 Tereshkova, V… fema…          1937 U.S.S.R/Ru…
## # ℹ 555 more rows
## # ℹ 16 more variables: military_civilian <chr>, selection <chr>,
## #   year_of_selection <dbl>, mission_number <dbl>,
## #   total_number_of_missions <dbl>, occupation <chr>, year_of_mission <dbl>,
## #   mission_title <chr>, ascend_shuttle <chr>, in_orbit <chr>,
## #   descend_shuttle <chr>, hours_mission <dbl>, total_hrs_sum <dbl>,
## #   field21 <dbl>, eva_hrs_mission <dbl>, total_eva_hrs <dbl>

Astronauts with same mission start date

astronaut_mission_group <- astronaut_data |>
  group_by(year_of_mission, nationality) |>
  summarise(astronaut_count = n(), .groups = 'drop')
ggplot(astronaut_mission_group, aes(x = year_of_mission, y = astronaut_count, fill = nationality)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Astronauts with the Same Mission Start Date", 
       x = "Mission Start Date", 
       y = "Number of Astronauts",
       fill = "Nation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

above_25 <- astronaut_mission_group |>
  filter(astronaut_count > 25) 
above_25
## # A tibble: 1 × 3
##   year_of_mission nationality astronaut_count
##             <dbl> <chr>                 <int>
## 1            1985 U.S.                     30

This is just a initial visualization, need to further investigate more this which nation has the most number of missions and which astronauts participated in the above mission. Additionally need to find out if the above mission happened is a single mission or multiple missions together. Also, try to build a better visualization.

Number of astronauts by mission

astronaut_same_mission <- astronaut_data |>
  group_by(nationality, mission_title) |>
  summarise(astronaut_count = n(), .groups = 'drop')
astronaut_same_mission
## # A tibble: 354 × 3
##    nationality mission_title     astronaut_count
##    <chr>       <chr>                       <int>
##  1 Afghanistan 3                               1
##  2 Australia   STS-77                          1
##  3 Austria     Soyuz TM13/12                   1
##  4 Belgium     STS-45                          1
##  5 Belgium     Soyuz TMA-1/TM-34               1
##  6 Brazil      Soyuz TMA-8                     1
##  7 Bulgaria    2                               1
##  8 Bulgaria    Soyuz 33                        1
##  9 Canada      57                              1
## 10 Canada      STS-41-G                        1
## # ℹ 344 more rows
astronaut_same_mission <- astronaut_data |>
  group_by(nationality, mission_title) |>
  summarise(astronaut_count = n(), .groups = 'drop') |>
  filter(astronaut_count > 3)

We can visualize the missions who has more than 3 counts

ggplot(astronaut_same_mission, aes(x = mission_title, y = astronaut_count, fill = nationality)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Number of Astronauts by Nationality for Each Mission", 
       x = "Mission Title", 
       y = "Number of Astronauts",
       fill = "Nationality") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10), 
        plot.title = element_text(size = 16, face = "bold"))

From above visualization, we could understand that the US mission with count 30 are separate mission that occured at the same time, and we need to further investigate it.

Plan Moving Forward (To-Do List)

Data Preparation:

  • Clean the data: Ensure all relevant data points, such as astronaut names, mission dates, and nations, are accurate and consistent.

  • Create new variables:

    A variable for each mission that lists all astronauts who participated.

    A variable to track the start and end dates of each mission for different nations to analyze overlaps.

Initial Findings

Based on the astronaut dataset, below are my two hypothesis

Hypothesis 1:

Astronauts from the same nation are more likely to travel together on multiple missions.

travel_together <- astronaut_data |>
  group_by(mission_title) |>
  summarise(nationality_count = n_distinct(nationality), .groups = 'drop')

travel_together
## # A tibble: 295 × 2
##    mission_title nationality_count
##    <chr>                     <int>
##  1 1                             2
##  2 10                            1
##  3 11                            1
##  4 12                            1
##  5 13                            1
##  6 14                            2
##  7 15                            2
##  8 16                            2
##  9 17                            2
## 10 18                            1
## # ℹ 285 more rows
install.packages("plotly", repos = "https://cloud.r-project.org")
## 
## The downloaded binary packages are in
##  /var/folders/63/_xq27w2j2ys29sqcy3mh092c0000gn/T//RtmpX6C6q7/downloaded_packages
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
p <- ggplot(travel_together, aes(x = mission_title, y = nationality_count)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Number of Astronauts by Nationality for Each Mission", 
       #x = "Mission Title", 
       y = "Nationality count",
       fill = "Nationality") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9), 
        plot.title = element_text(size = 10, face = "bold")) 

interactive_plot <- ggplotly(p)
interactive_plot

From above visualization, we now know that astronauts from different nations travel together, we need to find more above which all nations collaborated and what are the frequencies.

Hypothesis 2:

Missions between different nations have increased collaboration post-2000.

travel_together_2000 <- astronaut_data |>
  group_by(mission_title, year_of_mission) |>
  summarise(nationality_count = n_distinct(nationality), .groups = 'drop') |>
  arrange(year_of_mission)
travel_together_2000
## # A tibble: 322 × 3
##    mission_title   year_of_mission nationality_count
##    <chr>                     <dbl>             <int>
##  1 Vostok 1                   1961                 1
##  2 Vostok 2                   1961                 1
##  3 MA-6                       1962                 1
##  4 Mercury-Atlas 7            1962                 1
##  5 Mercury-Atlas 8            1962                 1
##  6 Vostok 3                   1962                 1
##  7 Vostok 4                   1962                 1
##  8 Mercury-Atlas 9            1963                 1
##  9 Vostok 5                   1963                 1
## 10 Vostok 6                   1963                 1
## # ℹ 312 more rows
p <- ggplot(travel_together_2000, aes(x = year_of_mission, y = nationality_count,fill = nationality_count)) +
  geom_bar(stat = "identity", position = "stack") +  
  labs(title = "Yearly contribution analysis", 
       x = "year_of_mission", 
       y = "Nationality count",
       fill = "Nationality") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9), 
        plot.title = element_text(size = 10, face = "bold"))

interactive_plot <- ggplotly(p)
interactive_plot

This is just a plot showing yearly collaboration between nations, we need to dive deep to find out which all nations participated in the collaboration.

We could also see that the graph satisfies our hypothesis, even though the missions peaked during 1980, it gradually decreased till 2020, we could see that the collaborations increased(blue colored boxes)

We need to invesitigate further on this to try to figure out which nations participated, which all astronauts participated together and which missions where involved.