Hypothesis 1

Null Hypothesis (H0): There is no difference between mean age of male astronauts and female astronauts at the time of their year of selection

We want to test if there is a significant difference in the average of Male and Female astronauts at the time of their year of selection

Alternative Hypothesis (H1): There is a difference in the mean age of male and female astronauts at the time of their year of selection

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(effsize)
library(pwrss)
## 
## Attaching package: 'pwrss'
## The following object is masked from 'package:stats':
## 
##     power.t.test
astro <- read_delim('/Users/sneha/H510-Statistics/astronaut-data.csv')
## Rows: 1277 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, nationality, military_civilian, selection, occupation, ...
## dbl (13): id, number, nationwide_number, year_of_birth, year_of_selection, m...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Getting unique astronauts

astro_unique <- astro[!duplicated(astro$name), ]
astro_unique
## # A tibble: 564 × 23
##       id number nationwide_number name           sex   year_of_birth nationality
##    <dbl>  <dbl>             <dbl> <chr>          <chr>         <dbl> <chr>      
##  1     1      1                 1 Gagarin, Yuri  male           1934 U.S.S.R/Ru…
##  2     2      2                 2 Titov, Gherman male           1935 U.S.S.R/Ru…
##  3     3      3                 1 Glenn, John H… male           1921 U.S.       
##  4     5      4                 2 Carpenter, M.… male           1925 U.S.       
##  5     6      5                 2 Nikolayev, An… male           1929 U.S.S.R/Ru…
##  6     8      6                 4 Popovich, Pav… male           1930 U.S.S.R/Ru…
##  7    10      7                 3 Schirra, Walt… male           1923 U.S.       
##  8    13      8                 4 Cooper, L. Go… male           1927 U.S.       
##  9    15      9                 5 Bykovsky, Val… male           1934 U.S.S.R/Ru…
## 10    18     10                 6 Tereshkova, V… fema…          1937 U.S.S.R/Ru…
## # ℹ 554 more rows
## # ℹ 16 more variables: military_civilian <chr>, selection <chr>,
## #   year_of_selection <dbl>, mission_number <dbl>,
## #   total_number_of_missions <dbl>, occupation <chr>, year_of_mission <dbl>,
## #   mission_title <chr>, ascend_shuttle <chr>, in_orbit <chr>,
## #   descend_shuttle <chr>, hours_mission <dbl>, total_hrs_sum <dbl>,
## #   field21 <dbl>, eva_hrs_mission <dbl>, total_eva_hrs <dbl>

Creating a new age column

astro_unique$age <- astro_unique$year_of_selection - astro_unique$year_of_birth
astro_unique
## # A tibble: 564 × 24
##       id number nationwide_number name           sex   year_of_birth nationality
##    <dbl>  <dbl>             <dbl> <chr>          <chr>         <dbl> <chr>      
##  1     1      1                 1 Gagarin, Yuri  male           1934 U.S.S.R/Ru…
##  2     2      2                 2 Titov, Gherman male           1935 U.S.S.R/Ru…
##  3     3      3                 1 Glenn, John H… male           1921 U.S.       
##  4     5      4                 2 Carpenter, M.… male           1925 U.S.       
##  5     6      5                 2 Nikolayev, An… male           1929 U.S.S.R/Ru…
##  6     8      6                 4 Popovich, Pav… male           1930 U.S.S.R/Ru…
##  7    10      7                 3 Schirra, Walt… male           1923 U.S.       
##  8    13      8                 4 Cooper, L. Go… male           1927 U.S.       
##  9    15      9                 5 Bykovsky, Val… male           1934 U.S.S.R/Ru…
## 10    18     10                 6 Tereshkova, V… fema…          1937 U.S.S.R/Ru…
## # ℹ 554 more rows
## # ℹ 17 more variables: military_civilian <chr>, selection <chr>,
## #   year_of_selection <dbl>, mission_number <dbl>,
## #   total_number_of_missions <dbl>, occupation <chr>, year_of_mission <dbl>,
## #   mission_title <chr>, ascend_shuttle <chr>, in_orbit <chr>,
## #   descend_shuttle <chr>, hours_mission <dbl>, total_hrs_sum <dbl>,
## #   field21 <dbl>, eva_hrs_mission <dbl>, total_eva_hrs <dbl>, age <dbl>

Visualization of different ages in the daatset

plot(astro_unique$age, type = "h", col = "blue", xlab = "Index", ylab = "Age", main = "Age at Year of Selection")

I am going to decide the below:

astro_unique |>
  group_by(sex) |>
  summarize(sd = sd(age),
            mean = mean(age))
## # A tibble: 2 × 3
##   sex       sd  mean
##   <chr>  <dbl> <dbl>
## 1 female  3.74  32.5
## 2 male    5.24  34.6

These standard deviations are roughly equal, so we can just use the whole dataset to approximate the overall value.

Kappa, here is the ratio between the two samples sizes, and we assume they’re equal.

test <- pwrss.t.2means(mu1 = 2, 
                       sd1 = sd(pluck(astro_unique, "age")),
                       kappa = 1,
                       power = .80, alpha = 0.05, 
                       alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 105 
##   n2 = 105 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 208 
##  Non-centrality parameter = 2.822 
##  Type I error rate = 0.05 
##  Type II error rate = 0.2
plot(test)
## Warning in qt(1 - prob.extreme, df = df, ncp = ncp, lower.tail = TRUE): full
## precision may not have been achieved in 'pnt{final}'

Non-centrality parameter = 2.822: This parameter reflects the effect size of the difference between the means.

Sample size - n1 = 105, n2 = 105: Both groups have 105 participants.

Now to test if our hypothesis, we can use t-test

Two-sided test for finding the difference of means:

astro_unique <- astro_unique |>
  filter(sex %in% c("male", "female")) |>
  filter(!is.na(age) & !is.na(sex))
my_test <- t.test(age ~ sex, data = astro_unique,
                  alternative = "two.sided")

my_test
## 
##  Welch Two Sample t-test
## 
## data:  age by sex
## t = -3.9678, df = 97.886, p-value = 0.0001383
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -3.114323 -1.037677
## sample estimates:
## mean in group female   mean in group male 
##               32.500               34.576

Note that

p-value = 0.0001383

which is roughly equal to 0, which means that the p-value is more significant and mean(male) != mean(female) there is difference between ages

The confidence interval doesnt include ‘0’, that means that the mean is never zero even though the difference is very small. But this is a reason to reject our null hypothesis that there is no difference between the ages of female and male astronauts

Lets visualize the mean ages of Female and male astronauts

mean_age <- astro_unique |>
  group_by(sex) |>
  summarize(mean_age = mean(age))
ggplot(mean_age, aes(x = sex, y = mean_age, fill = sex)) +
  geom_bar(stat = "identity", width = 0.5) +  
  
  labs(title = "Mean Age by Sex", x = "Gender", y = "Mean Age") +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("steelblue", "pink"))

As we can see that mean age of female and male are slightly little different, hence we can actually proceed with accepting out alternative hypothesis that there is a difference in the mean age of male and female astronauts at the time of their year of selection

Hypothesis 2

Null Hypothesis (H0): There is no association between the gender of the astronaut and whether they have completed a mission.
Alternative Hypothesis (H1): There is an association between the gender of the astronaut and mission completion.

data <- astro |>
  mutate(mission_completed = ifelse(hours_mission > 0, "Yes", "No"))

contingency_table <- table(data$sex, data$mission_completed)

fisher_test_result <- fisher.test(contingency_table)


fisher_test_result
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 0.5103
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   0.03338758 14.34889202
## sample estimates:
## odds ratio 
##   1.589441
contingency_table
##         
##            No  Yes
##   female    1  142
##   male      5 1129

The contingency table summarizes the counts of astronauts by gender and mission completion status.

library(ggplot2)

ggplot(data, aes(x = sex, fill = mission_completed)) +
  geom_bar(position = "dodge") +
  labs(title = "Mission Completion by Gender", x = "Gender", y = "Count of Astronauts") +
  theme_minimal()

This graph shows us that our assumption is true that there is no connection between gender of astronauts and their mission completion status. Both female and male astronauts have not completed their missions.

So we could actually go with failing to reject our null hypothesis here.

Based on the Fisher’s Test, there is no statistically significant association between gender and the mission completion variable.