title: “data_discovery” author: “Sneha” date: “2024-10-09” output: html_document
This dataset contains publically available information about all astronauts who participated in space missions before 15 January 2020 collected from NASA, Roscosmos, and fun-made websites. The provided information includes full astronaut name, sex, date of birth, nationality, military status, a title and year of a selction program, and information about each mission completed by a particular astronaut such as a year, ascend and descend shuttle names, mission and extravehicular activity (EVAs) durations.
https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-07-14/readme.md
The main question, goal, or purpose for your project
The primary goal of this project is to analyze astronaut missions to identify collaborative space travel patterns between astronauts and explore mission overlaps between different nations. Specifically, we aim to:
Determine which astronauts frequently traveled together on missions.
Identify groups of astronauts who have traveled together most often.
Analyze if and when missions from different nations occurred simultaneously, examining international collaboration or competition in space exploration.
This analysis can provide insights into:
Team dynamics among astronauts (repeated collaboration on missions).
Patterns of International space exploration cooperation or competition.
Network Graph of Astronaut Co-Missions
Timeline of Simultaneous Missions by Nation
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
astro <- read_delim('/Users/sneha/H510-Statistics/astronaut-data.csv')
## Rows: 1277 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, nationality, military_civilian, selection, occupation, ...
## dbl (13): id, number, nationwide_number, year_of_birth, year_of_selection, m...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
astronaut_data <- astro |>
distinct(name,year_of_birth, .keep_all = TRUE)
astronaut_data
## # A tibble: 565 × 23
## id number nationwide_number name sex year_of_birth nationality
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 1 1 1 Gagarin, Yuri male 1934 U.S.S.R/Ru…
## 2 2 2 2 Titov, Gherman male 1935 U.S.S.R/Ru…
## 3 3 3 1 Glenn, John H… male 1921 U.S.
## 4 5 4 2 Carpenter, M.… male 1925 U.S.
## 5 6 5 2 Nikolayev, An… male 1929 U.S.S.R/Ru…
## 6 8 6 4 Popovich, Pav… male 1930 U.S.S.R/Ru…
## 7 10 7 3 Schirra, Walt… male 1923 U.S.
## 8 13 8 4 Cooper, L. Go… male 1927 U.S.
## 9 15 9 5 Bykovsky, Val… male 1934 U.S.S.R/Ru…
## 10 18 10 6 Tereshkova, V… fema… 1937 U.S.S.R/Ru…
## # ℹ 555 more rows
## # ℹ 16 more variables: military_civilian <chr>, selection <chr>,
## # year_of_selection <dbl>, mission_number <dbl>,
## # total_number_of_missions <dbl>, occupation <chr>, year_of_mission <dbl>,
## # mission_title <chr>, ascend_shuttle <chr>, in_orbit <chr>,
## # descend_shuttle <chr>, hours_mission <dbl>, total_hrs_sum <dbl>,
## # field21 <dbl>, eva_hrs_mission <dbl>, total_eva_hrs <dbl>
astronaut_mission_group <- astronaut_data |>
group_by(year_of_mission, nationality) |>
summarise(astronaut_count = n(), .groups = 'drop')
ggplot(astronaut_mission_group, aes(x = year_of_mission, y = astronaut_count, fill = nationality)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Astronauts with the Same Mission Start Date",
x = "Mission Start Date",
y = "Number of Astronauts",
fill = "Nation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
above_25 <- astronaut_mission_group |>
filter(astronaut_count > 25)
above_25
## # A tibble: 1 × 3
## year_of_mission nationality astronaut_count
## <dbl> <chr> <int>
## 1 1985 U.S. 30
This is just a initial visualization, need to further investigate more this which nation has the most number of missions and which astronauts participated in the above mission. Additionally need to find out if the above mission happened is a single mission or multiple missions together. Also, try to build a better visualization.
astronaut_same_mission <- astronaut_data |>
group_by(nationality, mission_title) |>
summarise(astronaut_count = n(), .groups = 'drop')
astronaut_same_mission
## # A tibble: 354 × 3
## nationality mission_title astronaut_count
## <chr> <chr> <int>
## 1 Afghanistan 3 1
## 2 Australia STS-77 1
## 3 Austria Soyuz TM13/12 1
## 4 Belgium STS-45 1
## 5 Belgium Soyuz TMA-1/TM-34 1
## 6 Brazil Soyuz TMA-8 1
## 7 Bulgaria 2 1
## 8 Bulgaria Soyuz 33 1
## 9 Canada 57 1
## 10 Canada STS-41-G 1
## # ℹ 344 more rows
astronaut_same_mission <- astronaut_data |>
group_by(nationality, mission_title) |>
summarise(astronaut_count = n(), .groups = 'drop') |>
filter(astronaut_count > 3)
We can visualize the missions who has more than 3 counts
ggplot(astronaut_same_mission, aes(x = mission_title, y = astronaut_count, fill = nationality)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Number of Astronauts by Nationality for Each Mission",
x = "Mission Title",
y = "Number of Astronauts",
fill = "Nationality") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
plot.title = element_text(size = 16, face = "bold"))
From above visualization, we could understand that the US mission with count 30 are separate mission that occured at the same time, and we need to further investigate it.
Clean the data: Ensure all relevant data points, such as astronaut names, mission dates, and nations, are accurate and consistent.
Create new variables:
A variable for each mission that lists all astronauts who participated.
A variable to track the start and end dates of each mission for different nations to analyze overlaps.
Based on the astronaut dataset, below are my two hypothesis
Astronauts from the same nation are more likely to travel together on multiple missions.
travel_together <- astronaut_data |>
group_by(mission_title) |>
summarise(nationality_count = n_distinct(nationality), .groups = 'drop')
travel_together
## # A tibble: 295 × 2
## mission_title nationality_count
## <chr> <int>
## 1 1 2
## 2 10 1
## 3 11 1
## 4 12 1
## 5 13 1
## 6 14 2
## 7 15 2
## 8 16 2
## 9 17 2
## 10 18 1
## # ℹ 285 more rows
install.packages("plotly", repos = "https://cloud.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/63/_xq27w2j2ys29sqcy3mh092c0000gn/T//RtmpX6C6q7/downloaded_packages
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p <- ggplot(travel_together, aes(x = mission_title, y = nationality_count)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Number of Astronauts by Nationality for Each Mission",
#x = "Mission Title",
y = "Nationality count",
fill = "Nationality") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9),
plot.title = element_text(size = 10, face = "bold"))
interactive_plot <- ggplotly(p)
interactive_plot
From above visualization, we now know that astronauts from different nations travel together, we need to find more above which all nations collaborated and what are the frequencies.
Missions between different nations have increased collaboration post-2000.
travel_together_2000 <- astronaut_data |>
group_by(mission_title, year_of_mission) |>
summarise(nationality_count = n_distinct(nationality), .groups = 'drop') |>
arrange(year_of_mission)
travel_together_2000
## # A tibble: 322 × 3
## mission_title year_of_mission nationality_count
## <chr> <dbl> <int>
## 1 Vostok 1 1961 1
## 2 Vostok 2 1961 1
## 3 MA-6 1962 1
## 4 Mercury-Atlas 7 1962 1
## 5 Mercury-Atlas 8 1962 1
## 6 Vostok 3 1962 1
## 7 Vostok 4 1962 1
## 8 Mercury-Atlas 9 1963 1
## 9 Vostok 5 1963 1
## 10 Vostok 6 1963 1
## # ℹ 312 more rows
p <- ggplot(travel_together_2000, aes(x = year_of_mission, y = nationality_count,fill = nationality_count)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Yearly contribution analysis",
x = "year_of_mission",
y = "Nationality count",
fill = "Nationality") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9),
plot.title = element_text(size = 10, face = "bold"))
interactive_plot <- ggplotly(p)
interactive_plot
This is just a plot showing yearly collaboration between nations, we need to dive deep to find out which all nations participated in the collaboration.
We could also see that the graph satisfies our hypothesis, even though the missions peaked during 1980, it gradually decreased till 2020, we could see that the collaborations increased(blue colored boxes)
We need to invesitigate further on this to try to figure out which nations participated, which all astronauts participated together and which missions where involved.