Null Hypothesis (H0): There is no difference between mean age of male astronauts and female astronauts at the time of their year of selection
We want to test if there is a significant difference in the average of Male and Female astronauts at the time of their year of selection
Alternative Hypothesis (H1): There is a difference in the mean age of male and female astronauts at the time of their year of selection
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(effsize)
library(pwrss)
##
## Attaching package: 'pwrss'
## The following object is masked from 'package:stats':
##
## power.t.test
astro <- read_delim('/Users/sneha/H510-Statistics/astronaut-data.csv')
## Rows: 1277 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, nationality, military_civilian, selection, occupation, ...
## dbl (13): id, number, nationwide_number, year_of_birth, year_of_selection, m...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Getting unique astronauts
astro_unique <- astro[!duplicated(astro$name), ]
astro_unique
## # A tibble: 564 × 23
## id number nationwide_number name sex year_of_birth nationality
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 1 1 1 Gagarin, Yuri male 1934 U.S.S.R/Ru…
## 2 2 2 2 Titov, Gherman male 1935 U.S.S.R/Ru…
## 3 3 3 1 Glenn, John H… male 1921 U.S.
## 4 5 4 2 Carpenter, M.… male 1925 U.S.
## 5 6 5 2 Nikolayev, An… male 1929 U.S.S.R/Ru…
## 6 8 6 4 Popovich, Pav… male 1930 U.S.S.R/Ru…
## 7 10 7 3 Schirra, Walt… male 1923 U.S.
## 8 13 8 4 Cooper, L. Go… male 1927 U.S.
## 9 15 9 5 Bykovsky, Val… male 1934 U.S.S.R/Ru…
## 10 18 10 6 Tereshkova, V… fema… 1937 U.S.S.R/Ru…
## # ℹ 554 more rows
## # ℹ 16 more variables: military_civilian <chr>, selection <chr>,
## # year_of_selection <dbl>, mission_number <dbl>,
## # total_number_of_missions <dbl>, occupation <chr>, year_of_mission <dbl>,
## # mission_title <chr>, ascend_shuttle <chr>, in_orbit <chr>,
## # descend_shuttle <chr>, hours_mission <dbl>, total_hrs_sum <dbl>,
## # field21 <dbl>, eva_hrs_mission <dbl>, total_eva_hrs <dbl>
Creating a new age column
astro_unique$age <- astro_unique$year_of_selection - astro_unique$year_of_birth
astro_unique
## # A tibble: 564 × 24
## id number nationwide_number name sex year_of_birth nationality
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 1 1 1 Gagarin, Yuri male 1934 U.S.S.R/Ru…
## 2 2 2 2 Titov, Gherman male 1935 U.S.S.R/Ru…
## 3 3 3 1 Glenn, John H… male 1921 U.S.
## 4 5 4 2 Carpenter, M.… male 1925 U.S.
## 5 6 5 2 Nikolayev, An… male 1929 U.S.S.R/Ru…
## 6 8 6 4 Popovich, Pav… male 1930 U.S.S.R/Ru…
## 7 10 7 3 Schirra, Walt… male 1923 U.S.
## 8 13 8 4 Cooper, L. Go… male 1927 U.S.
## 9 15 9 5 Bykovsky, Val… male 1934 U.S.S.R/Ru…
## 10 18 10 6 Tereshkova, V… fema… 1937 U.S.S.R/Ru…
## # ℹ 554 more rows
## # ℹ 17 more variables: military_civilian <chr>, selection <chr>,
## # year_of_selection <dbl>, mission_number <dbl>,
## # total_number_of_missions <dbl>, occupation <chr>, year_of_mission <dbl>,
## # mission_title <chr>, ascend_shuttle <chr>, in_orbit <chr>,
## # descend_shuttle <chr>, hours_mission <dbl>, total_hrs_sum <dbl>,
## # field21 <dbl>, eva_hrs_mission <dbl>, total_eva_hrs <dbl>, age <dbl>
Visualization of different ages in the daatset
plot(astro_unique$age, type = "h", col = "blue", xlab = "Index", ylab = "Age", main = "Age at Year of Selection")
I am going to decide the below:
astro_unique |>
group_by(sex) |>
summarize(sd = sd(age),
mean = mean(age))
## # A tibble: 2 × 3
## sex sd mean
## <chr> <dbl> <dbl>
## 1 female 3.74 32.5
## 2 male 5.24 34.6
These standard deviations are roughly equal, so we can just use the whole dataset to approximate the overall value.
Kappa, here is the ratio between the two samples sizes, and we assume they’re equal.
test <- pwrss.t.2means(mu1 = 2,
sd1 = sd(pluck(astro_unique, "age")),
kappa = 1,
power = .80, alpha = 0.05,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.8
## n1 = 105
## n2 = 105
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 208
## Non-centrality parameter = 2.822
## Type I error rate = 0.05
## Type II error rate = 0.2
plot(test)
## Warning in qt(1 - prob.extreme, df = df, ncp = ncp, lower.tail = TRUE): full
## precision may not have been achieved in 'pnt{final}'
Non-centrality parameter = 2.822: This parameter reflects the effect size of the difference between the means.
Sample size - n1 = 105, n2 = 105: Both groups have 105 participants.
Two-sided test for finding the difference of means:
astro_unique <- astro_unique |>
filter(sex %in% c("male", "female")) |>
filter(!is.na(age) & !is.na(sex))
my_test <- t.test(age ~ sex, data = astro_unique,
alternative = "two.sided")
my_test
##
## Welch Two Sample t-test
##
## data: age by sex
## t = -3.9678, df = 97.886, p-value = 0.0001383
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -3.114323 -1.037677
## sample estimates:
## mean in group female mean in group male
## 32.500 34.576
Note that
p-value = 0.0001383
which is roughly equal to 0, which means that the p-value is more significant and mean(male) != mean(female) there is difference between ages
The confidence interval doesnt include ‘0’, that means that the mean is never zero even though the difference is very small. But this is a reason to reject our null hypothesis that there is no difference between the ages of female and male astronauts
Lets visualize the mean ages of Female and male astronauts
mean_age <- astro_unique |>
group_by(sex) |>
summarize(mean_age = mean(age))
ggplot(mean_age, aes(x = sex, y = mean_age, fill = sex)) +
geom_bar(stat = "identity", width = 0.5) +
labs(title = "Mean Age by Sex", x = "Gender", y = "Mean Age") +
theme_minimal() +
theme(legend.position = "none") +
scale_fill_manual(values = c("steelblue", "pink"))
As we can see that mean age of female and male are slightly little different, hence we can actually proceed with accepting out alternative hypothesis that there is a difference in the mean age of male and female astronauts at the time of their year of selection
Null Hypothesis (H0): There is no association
between the gender of the astronaut and whether they have completed a
mission.
Alternative Hypothesis (H1): There is an association
between the gender of the astronaut and mission completion.
data <- astro |>
mutate(mission_completed = ifelse(hours_mission > 0, "Yes", "No"))
contingency_table <- table(data$sex, data$mission_completed)
fisher_test_result <- fisher.test(contingency_table)
fisher_test_result
##
## Fisher's Exact Test for Count Data
##
## data: contingency_table
## p-value = 0.5103
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.03338758 14.34889202
## sample estimates:
## odds ratio
## 1.589441
contingency_table
##
## No Yes
## female 1 142
## male 5 1129
The contingency table summarizes the counts of astronauts by gender and mission completion status.
Female: 1 female astronaut has not completed a mission, while 142 have completed a mission.
Male: 5 male astronauts have not completed a mission, while 1129 have completed a mission.
p-value = 0.5103
The p-value indicates the probability of observing the data under the null hypothesis. A p-value of 0.5103 suggests that there is not enough evidence to reject the null hypothesis.
Since 0.5103 is much greater value, we fail to reject the null hypothesis. This means we do not have sufficient evidence to conclude that there is an association between the gender of the astronaut and mission completion status.
Confidence interval - This interval suggests that the true odds ratio could be as low as 0.033 or as high as 14.35. Since it includes 1 (which represents no difference), it supports the conclusion that there is no significant difference in the odds of saying “Yes” between females and males.
Based on the results of the Fisher’s Test, we fail to reject the null hypothesis. This means there is no statistically significant association between the gender of astronauts and their mission completion status. The data does not provide sufficient evidence to conclude that gender influences whether astronauts complete missions. However, the odds ratio suggests that female astronauts might have a higher likelihood of completing missions than male astronauts, but this finding is not statistically significant.
library(ggplot2)
ggplot(data, aes(x = sex, fill = mission_completed)) +
geom_bar(position = "dodge") +
labs(title = "Mission Completion by Gender", x = "Gender", y = "Count of Astronauts") +
theme_minimal()
This graph shows us that our assumption is true that there is no connection between gender of astronauts and their mission completion status. Both female and male astronauts have not completed their missions.
So we could actually go with failing to reject our null hypothesis here.
Based on the Fisher’s Test, there is no statistically significant association between gender and the mission completion variable.