week-7

Hypothesis 1

Null Hypothesis (H0): There is no difference between mean age of male astronauts and female astronauts at the time of their year of selection

We want to test if there is a significant difference in the average of Male and Female astronauts at the time of their year of selection

Alternative Hypothesis (H1): There is a difference in the mean age of male and female astronauts at the time of their year of selection

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(effsize)
library(pwrss)

## 
## Attaching package: 'pwrss'

## The following object is masked from 'package:stats':
## 
##     power.t.test

astro <- read_delim('/Users/sneha/H510-Statistics/astronaut-data.csv')

## Rows: 1277 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, nationality, military_civilian, selection, occupation, ...
## dbl (13): id, number, nationwide_number, year_of_birth, year_of_selection, m...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Getting unique astronauts

astro_unique <- astro[!duplicated(astro$name), ]
astro_unique

## # A tibble: 564 × 23
##       id number nationwide_number name           sex   year_of_birth nationality
##    <dbl>  <dbl>             <dbl> <chr>          <chr>         <dbl> <chr>      
##  1     1      1                 1 Gagarin, Yuri  male           1934 U.S.S.R/Ru…
##  2     2      2                 2 Titov, Gherman male           1935 U.S.S.R/Ru…
##  3     3      3                 1 Glenn, John H… male           1921 U.S.       
##  4     5      4                 2 Carpenter, M.… male           1925 U.S.       
##  5     6      5                 2 Nikolayev, An… male           1929 U.S.S.R/Ru…
##  6     8      6                 4 Popovich, Pav… male           1930 U.S.S.R/Ru…
##  7    10      7                 3 Schirra, Walt… male           1923 U.S.       
##  8    13      8                 4 Cooper, L. Go… male           1927 U.S.       
##  9    15      9                 5 Bykovsky, Val… male           1934 U.S.S.R/Ru…
## 10    18     10                 6 Tereshkova, V… fema…          1937 U.S.S.R/Ru…
## # ℹ 554 more rows
## # ℹ 16 more variables: military_civilian <chr>, selection <chr>,
## #   year_of_selection <dbl>, mission_number <dbl>,
## #   total_number_of_missions <dbl>, occupation <chr>, year_of_mission <dbl>,
## #   mission_title <chr>, ascend_shuttle <chr>, in_orbit <chr>,
## #   descend_shuttle <chr>, hours_mission <dbl>, total_hrs_sum <dbl>,
## #   field21 <dbl>, eva_hrs_mission <dbl>, total_eva_hrs <dbl>

Creating a new age column

astro_unique$age <- astro_unique$year_of_selection - astro_unique$year_of_birth
astro_unique

## # A tibble: 564 × 24
##       id number nationwide_number name           sex   year_of_birth nationality
##    <dbl>  <dbl>             <dbl> <chr>          <chr>         <dbl> <chr>      
##  1     1      1                 1 Gagarin, Yuri  male           1934 U.S.S.R/Ru…
##  2     2      2                 2 Titov, Gherman male           1935 U.S.S.R/Ru…
##  3     3      3                 1 Glenn, John H… male           1921 U.S.       
##  4     5      4                 2 Carpenter, M.… male           1925 U.S.       
##  5     6      5                 2 Nikolayev, An… male           1929 U.S.S.R/Ru…
##  6     8      6                 4 Popovich, Pav… male           1930 U.S.S.R/Ru…
##  7    10      7                 3 Schirra, Walt… male           1923 U.S.       
##  8    13      8                 4 Cooper, L. Go… male           1927 U.S.       
##  9    15      9                 5 Bykovsky, Val… male           1934 U.S.S.R/Ru…
## 10    18     10                 6 Tereshkova, V… fema…          1937 U.S.S.R/Ru…
## # ℹ 554 more rows
## # ℹ 17 more variables: military_civilian <chr>, selection <chr>,
## #   year_of_selection <dbl>, mission_number <dbl>,
## #   total_number_of_missions <dbl>, occupation <chr>, year_of_mission <dbl>,
## #   mission_title <chr>, ascend_shuttle <chr>, in_orbit <chr>,
## #   descend_shuttle <chr>, hours_mission <dbl>, total_hrs_sum <dbl>,
## #   field21 <dbl>, eva_hrs_mission <dbl>, total_eva_hrs <dbl>, age <dbl>

Visualization of different ages in the daatset

plot(astro_unique$age, type = "h", col = "blue", xlab = "Index", ylab = "Age", main = "Age at Year of Selection")

I am going to decide the below:

a meaningful difference in age is 2
we can accept \(\alpha = 0.05\) since i am willing to accept a 5% risk of concluding that there is a difference in ages when there actually isn’t one.
Power = 0.80 (80%): This is the typical threshold, if i have an 80% chance of detecting a true difference if one exists. Choosing this because it is a common choice for general hypothesis testing.

astro_unique |>
  group_by(sex) |>
  summarize(sd = sd(age),
            mean = mean(age))

## # A tibble: 2 × 3
##   sex       sd  mean
##   <chr>  <dbl> <dbl>
## 1 female  3.74  32.5
## 2 male    5.24  34.6

These standard deviations are roughly equal, so we can just use the whole dataset to approximate the overall value.

Kappa, here is the ratio between the two samples sizes, and we assume they’re equal.

test <- pwrss.t.2means(mu1 = 2, 
                       sd1 = sd(pluck(astro_unique, "age")),
                       kappa = 1,
                       power = .80, alpha = 0.05, 
                       alternative = "not equal")

##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 105 
##   n2 = 105 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 208 
##  Non-centrality parameter = 2.822 
##  Type I error rate = 0.05 
##  Type II error rate = 0.2

plot(test)

## Warning in qt(1 - prob.extreme, df = df, ncp = ncp, lower.tail = TRUE): full
## precision may not have been achieved in 'pnt{final}'

Non-centrality parameter = 2.822: This parameter reflects the effect size of the difference between the means.

Sample size - n1 = 105, n2 = 105: Both groups have 105 participants.

Now to test if our hypothesis, we can use t-test

Two-sided test for finding the difference of means:

astro_unique <- astro_unique |>
  filter(sex %in% c("male", "female")) |>
  filter(!is.na(age) & !is.na(sex))
my_test <- t.test(age ~ sex, data = astro_unique,
                  alternative = "two.sided")

my_test

## 
##  Welch Two Sample t-test
## 
## data:  age by sex
## t = -3.9678, df = 97.886, p-value = 0.0001383
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -3.114323 -1.037677
## sample estimates:
## mean in group female   mean in group male 
##               32.500               34.576

Note that

p-value = 0.0001383

which is roughly equal to 0, which means that the p-value is more significant and mean(male) != mean(female) there is difference between ages

The confidence interval doesnt include ‘0’, that means that the mean is never zero even though the difference is very small. But this is a reason to reject our null hypothesis that there is no difference between the ages of female and male astronauts

Lets visualize the mean ages of Female and male astronauts

mean_age <- astro_unique |>
  group_by(sex) |>
  summarize(mean_age = mean(age))

ggplot(mean_age, aes(x = sex, y = mean_age, fill = sex)) +
  geom_bar(stat = "identity", width = 0.5) +  
  
  labs(title = "Mean Age by Sex", x = "Gender", y = "Mean Age") +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("steelblue", "pink"))

As we can see that mean age of female and male are slightly little different, hence we can actually proceed with accepting out alternative hypothesis that there is a difference in the mean age of male and female astronauts at the time of their year of selection

Hypothesis 2

Null Hypothesis (H0): There is no association between the gender of the astronaut and whether they have completed a mission.
Alternative Hypothesis (H1): There is an association between the gender of the astronaut and mission completion.

data <- astro |>
  mutate(mission_completed = ifelse(hours_mission > 0, "Yes", "No"))

contingency_table <- table(data$sex, data$mission_completed)

fisher_test_result <- fisher.test(contingency_table)


fisher_test_result

## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 0.5103
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   0.03338758 14.34889202
## sample estimates:
## odds ratio 
##   1.589441

contingency_table

##         
##            No  Yes
##   female    1  142
##   male      5 1129

The contingency table summarizes the counts of astronauts by gender and mission completion status.

Female: 1 female astronaut has not completed a mission, while 142 have completed a mission.
Male: 5 male astronauts have not completed a mission, while 1129 have completed a mission.
p-value = 0.5103

The p-value indicates the probability of observing the data under the null hypothesis. A p-value of 0.5103 suggests that there is not enough evidence to reject the null hypothesis.

Since 0.5103 is much greater value, we fail to reject the null hypothesis. This means we do not have sufficient evidence to conclude that there is an association between the gender of the astronaut and mission completion status.
Confidence interval - This interval suggests that the true odds ratio could be as low as 0.033 or as high as 14.35. Since it includes 1 (which represents no difference), it supports the conclusion that there is no significant difference in the odds of saying “Yes” between females and males.

Based on the results of the Fisher’s Test, we fail to reject the null hypothesis. This means there is no statistically significant association between the gender of astronauts and their mission completion status. The data does not provide sufficient evidence to conclude that gender influences whether astronauts complete missions. However, the odds ratio suggests that female astronauts might have a higher likelihood of completing missions than male astronauts, but this finding is not statistically significant.

library(ggplot2)

ggplot(data, aes(x = sex, fill = mission_completed)) +
  geom_bar(position = "dodge") +
  labs(title = "Mission Completion by Gender", x = "Gender", y = "Count of Astronauts") +
  theme_minimal()

This graph shows us that our assumption is true that there is no connection between gender of astronauts and their mission completion status. Both female and male astronauts have not completed their missions.

So we could actually go with failing to reject our null hypothesis here.

Based on the Fisher’s Test, there is no statistically significant association between gender and the mission completion variable.

week-7

Sneha

2024-10-18

Hypothesis 1

Now to test if our hypothesis, we can use t-test

Hypothesis 2