Load the dataset:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read_delim('/Users/sneha/H510-Statistics/astronaut.csv')
## Rows: 1277 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, nationality, military_civilian, selection, occupation, ...
## dbl (13): id, number, nationwide_number, year_of_birth, year_of_selection, m...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Create 3 Group By Data Frames

Group by nationality and summarize total missions and total EVA hours:

nationality_summary <- data |>
  group_by(nationality) |>
  summarize(total_missions = sum(total_number_of_missions),
            total_eva_hours = sum(total_eva_hrs))
print(nationality_summary)
## # A tibble: 40 × 3
##    nationality    total_missions total_eva_hours
##    <chr>                   <dbl>           <dbl>
##  1 Afghanistan                 1            0   
##  2 Australia                  16           25.4 
##  3 Austria                     1            0   
##  4 Belgium                     5            0   
##  5 Brazil                      1            0   
##  6 Bulgaria                    2            0   
##  7 Canada                     38          101.  
##  8 China                      22            0.26
##  9 Cuba                        1            0   
## 10 Czechoslovakia              1            0   
## # ℹ 30 more rows

This result will help us get insights on how astronaut missions and EVA hours vary by nationality. There are some countries which has very minimum total mission suggesting that they are not very active in space industry.

Group by sex and summarize total hours in orbit and total EVA hours:

sex_summary <- data |>
  group_by(sex) |>
  summarize(total_hours_orbit = sum(total_hrs_sum),
            total_eva_hours = sum(total_eva_hrs))
print(sex_summary)
## # A tibble: 2 × 3
##   sex    total_hours_orbit total_eva_hours
##   <chr>              <dbl>           <dbl>
## 1 female           279966.            697.
## 2 male            3510606.          13045.

This result allow us to explore the difference between male and female astronauts in terms of hours spent in orbit and EVA hours. From the above data we see that male astronauts have more hours in orbit as well and total eva hours, which indicates that male astronauts are currently more represented in space missions compared to female astronauts.

Group by military/civilian status and summarize total missions and total hours in orbit:

mil_civil_summary <- data |>
  group_by(military_civilian) |>
  summarize(total_missions = sum(total_number_of_missions),
            total_hours_orbit = sum(total_hrs_sum))
print(mil_civil_summary)
## # A tibble: 2 × 3
##   military_civilian total_missions total_hours_orbit
##   <chr>                      <dbl>             <dbl>
## 1 civilian                    1532          1545595.
## 2 military                    2277          2244977.

This grouping allows us to compare the mission and orbit time contributions between military and civilian astronauts. According to the above data, there is no significant difference between the hours and mission of military/civilian astronauts. Hence we can conclude that Military /civilian status doesnt have much significance on the mission completion duration.

Assigning lowest probability group a special tag

#Choosing nationality dataframe
Low_nationality_grp <- nationality_summary |>
    filter(total_missions == min(total_missions))

#special tag to the lowest probability
nationality_summary <- nationality_summary |>
   mutate(tag = if_else(total_missions == min(total_missions), "Low_Probability", "Normal"))
Low_nationality_grp
## # A tibble: 23 × 3
##    nationality    total_missions total_eva_hours
##    <chr>                   <dbl>           <dbl>
##  1 Afghanistan                 1               0
##  2 Austria                     1               0
##  3 Brazil                      1               0
##  4 Cuba                        1               0
##  5 Czechoslovakia              1               0
##  6 Denmark                     1               0
##  7 Hungry                      1               0
##  8 India                       1               0
##  9 Israel                      1               0
## 10 Kazakhstan                  1               0
## # ℹ 13 more rows
nationality_summary
## # A tibble: 40 × 4
##    nationality    total_missions total_eva_hours tag            
##    <chr>                   <dbl>           <dbl> <chr>          
##  1 Afghanistan                 1            0    Low_Probability
##  2 Australia                  16           25.4  Normal         
##  3 Austria                     1            0    Low_Probability
##  4 Belgium                     5            0    Normal         
##  5 Brazil                      1            0    Low_Probability
##  6 Bulgaria                    2            0    Normal         
##  7 Canada                     38          101.   Normal         
##  8 China                      22            0.26 Normal         
##  9 Cuba                        1            0    Low_Probability
## 10 Czechoslovakia              1            0    Low_Probability
## # ℹ 30 more rows

Hypothesis and probability conclusion

From the nationality_summary , we observe that astronauts from certain countries have fewer missions. like Afghanistan, Austria, Brazil, Cuba, Czechoslovakia.

hypothesis: Astronauts from countries with fewer missions likely have smaller space programs, leading to less opportunity for space missions.

This might even be because space mission are expensive and countries with smaller or developing economies may prioritize other immediate needs like healthcare, education, or infrastructure development over space programs.

Some countries may not yet have the advanced technology necessary to launch such missions.

low probability in grouping by sex:

Low_probab_grp <- sex_summary |>
    filter(total_hours_orbit == min(total_hours_orbit))
#special tag to the lowest probability
sex_summary <- sex_summary |>
   mutate(tag = if_else(total_hours_orbit == min(total_hours_orbit), "Low_Probability", "Normal"))
Low_probab_grp
## # A tibble: 1 × 3
##   sex    total_hours_orbit total_eva_hours
##   <chr>              <dbl>           <dbl>
## 1 female           279966.            697.
sex_summary
## # A tibble: 2 × 4
##   sex    total_hours_orbit total_eva_hours tag            
##   <chr>              <dbl>           <dbl> <chr>          
## 1 female           279966.            697. Low_Probability
## 2 male            3510606.          13045. Normal

hypothesis: Male astronauts are currently more represented in space missions compared to female astronauts.

This can be due to gender societal stereotypes about women’s roles often relegated them to domestic spheres, limiting their opportunities in fields like science, engineering, and aerospace. The number of women in STEM programs has been historically lower than that of men. This reduces the number of potential female candidates for space programs.

Building visualizations

Visualization of total missions by nationality:

ggplot(nationality_summary, aes(x = nationality, y = total_missions, fill = nationality,width = 20, height = 20)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  
  labs(title = "Total Missions by Nationality",
       x = "Nationality",
       y = "Total Missions")
## Warning: `position_stack()` requires non-overlapping x intervals.

The variables in the x-axis is not visible, but they are listed in the sidebar, also the countries are marked with color which makes it easy to identify which country has the most number of missions.(i tried making the graph bigger, but its not helping me get the country names aligned)

As we can see, the countries with pink shade has the most number of missions covered. these are US, Russia, UAE, UK.

These are mostly developed countries and have the technological capabilities necessary to launch and sustain such missions. Also Space exploration is expensive, requiring significant investments in research, technology, infrastructure, and skilled personnel, this data above indicates that these countries have significant resources and investment in space mission.

Visualization of hours in orbit by sex:

ggplot(sex_summary, aes(x = sex, y = total_hours_orbit, fill = sex)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Total Hours in Orbit by Sex",
       x = "Sex",
       y = "Total Hours in Orbit")

The graph above indicates the presence of male domination in space industry. Space industry is a profession that had almost no female representation in the 1950s and 1960s. This legacy has created long-standing gender imbalances. This might be due to the physical standards the earlier space programs had, like height and weight requirement. This is being prevalent in military as well, making it harder for women to qualify. But significant progress has been made toward gender equality in many fields, including space mission nowadays, which inlcudes the recent case of Sunita Williams, an American astronaut, has spent over 322 days in space, holding the record for the most spacewalks by a woman.

Create a data frame of all combinations of two categorical variables

selecting unique combinations of nationality and sex

combinations <- expand.grid(
  nationality = unique(data$nationality),
  sex = unique(data$sex)
)
combinations
##                 nationality    sex
## 1            U.S.S.R/Russia   male
## 2                      U.S.   male
## 3                  Mongolia   male
## 4                   Romania   male
## 5                    France   male
## 6            Czechoslovakia   male
## 7                    Poland   male
## 8                   Germany   male
## 9                  Bulgaria   male
## 10                   Hungry   male
## 11                  Vietnam   male
## 12                     Cuba   male
## 13                    India   male
## 14                   Canada   male
## 15             Saudi Arabia   male
## 16               Netherland   male
## 17                   Mexico   male
## 18                    Syria   male
## 19              Afghanistan   male
## 20                    Japan   male
## 21                     U.K.   male
## 22                  Austria   male
## 23                  Belgium   male
## 24              Switzerland   male
## 25                    Italy   male
## 26                Australia   male
## 27          U.S.S.R/Ukraine   male
## 28                    Spain   male
## 29                 Slovakia   male
## 30 Republic of South Africa   male
## 31                U.K./U.S.   male
## 32                   Israel   male
## 33                    China   male
## 34                   Brazil   male
## 35                   Sweden   male
## 36                  Malysia   male
## 37                    Korea   male
## 38                  Denmark   male
## 39               Kazakhstan   male
## 40                      UAE   male
## 41           U.S.S.R/Russia female
## 42                     U.S. female
## 43                 Mongolia female
## 44                  Romania female
## 45                   France female
## 46           Czechoslovakia female
## 47                   Poland female
## 48                  Germany female
## 49                 Bulgaria female
## 50                   Hungry female
## 51                  Vietnam female
## 52                     Cuba female
## 53                    India female
## 54                   Canada female
## 55             Saudi Arabia female
## 56               Netherland female
## 57                   Mexico female
## 58                    Syria female
## 59              Afghanistan female
## 60                    Japan female
## 61                     U.K. female
## 62                  Austria female
## 63                  Belgium female
## 64              Switzerland female
## 65                    Italy female
## 66                Australia female
## 67          U.S.S.R/Ukraine female
## 68                    Spain female
## 69                 Slovakia female
## 70 Republic of South Africa female
## 71                U.K./U.S. female
## 72                   Israel female
## 73                    China female
## 74                   Brazil female
## 75                   Sweden female
## 76                  Malysia female
## 77                    Korea female
## 78                  Denmark female
## 79               Kazakhstan female
## 80                      UAE female
existing_combinations <- data |>
  select(nationality, sex) |>
  distinct()

existing_combinations
## # A tibble: 48 × 2
##    nationality    sex   
##    <chr>          <chr> 
##  1 U.S.S.R/Russia male  
##  2 U.S.           male  
##  3 U.S.S.R/Russia female
##  4 Mongolia       male  
##  5 Romania        male  
##  6 France         male  
##  7 Czechoslovakia male  
##  8 Poland         male  
##  9 Germany        male  
## 10 Bulgaria       male  
## # ℹ 38 more rows

Finding missing combinations:

missing_combinations <- anti_join(combinations, existing_combinations, by = c("nationality", "sex"))
missing_combinations
##                 nationality    sex
## 1                     Korea   male
## 2                  Mongolia female
## 3                   Romania female
## 4            Czechoslovakia female
## 5                    Poland female
## 6                   Germany female
## 7                  Bulgaria female
## 8                    Hungry female
## 9                   Vietnam female
## 10                     Cuba female
## 11                    India female
## 12             Saudi Arabia female
## 13               Netherland female
## 14                   Mexico female
## 15                    Syria female
## 16              Afghanistan female
## 17                  Austria female
## 18                  Belgium female
## 19              Switzerland female
## 20                Australia female
## 21          U.S.S.R/Ukraine female
## 22                    Spain female
## 23                 Slovakia female
## 24 Republic of South Africa female
## 25                U.K./U.S. female
## 26                   Israel female
## 27                   Brazil female
## 28                   Sweden female
## 29                  Malysia female
## 30                  Denmark female
## 31               Kazakhstan female
## 32                      UAE female

It is very interesting to see that Korea has not sent a male astronaut to space. which might be due to missing records in the dataset.

Also, as expected, most of the countries have not sent female astronauts. The absence of female astronauts from various nationalities, excluding Korea, suggests there might be a gender representation issue. This could point to a lack of data or potentially highlight an underrepresentation of female astronauts in the dataset.

There is possibility that the dataset might be be up to date, as we can see there is a significant progress has been made toward gender equality in many fields, including space mission nowadays.