Response variable could be hours_mission, which represents the number of hours spent on a specific mission. For the explanatory variable, we use the categorical column “nationality” to explore whether astronauts’ nationality has any influence on the duration of their missions.

Null Hypothesis for ANOVA Test:

ANOVA test using R:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
astro <- read_delim('/Users/sneha/H510-Statistics/astronaut-data.csv')
## Rows: 1277 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, nationality, military_civilian, selection, occupation, ...
## dbl (13): id, number, nationwide_number, year_of_birth, year_of_selection, m...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
astro_data <- astro |> select(hours_mission, nationality)

Removing NA values

astro_data <- na.omit(astro_data)
astro_data$nationality <- as.factor(astro_data$nationality)

Performing ANOVA test:

anova <- aov(hours_mission ~ nationality, data = astro_data)
summary(anova)
##               Df    Sum Sq  Mean Sq F value Pr(>F)    
## nationality   39 8.336e+08 21375502    9.06 <2e-16 ***
## Residuals   1237 2.918e+09  2359295                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Result analysis:

nationality: 39 degrees of freedom, which means there are 40 different nationalities in the dataset.

Residuals: 1237 degrees of freedom, which refers to the remaining variance in the data after accounting for nationality.

8.336e+08 for nationality represents the variation in hours_mission that can be explained by the nationality variable.

The F-value is the ratio of the mean square for the nationality to the mean square of the residuals. In this case, it is 9.06, which indicates how much more variance nationality explains compared to random variation.

The p-value is reported as < 2e-16, which is extremely small (essentially zero), meaning the result is highly significant.

Since the p-value is much smaller than the common threshold of 0.05, we reject the null hypothesis. This means that there is strong evidence to suggest that the nationality of astronauts has a significant effect on the mission hours.

To support our assumption, we can think of practical reasons here, like : People interested in the data may infer that astronaut mission durations are influenced by nationality, which could be due to differences in training, mission types, or space programs between countries. However, further investigation would be needed to understand the specific reasons for this effect.

Linear Regression:

We selected year_of_selection as another continuous variable and built a linear regression model to predict hours_mission.

new_data <- astro |>select(hours_mission, nationality, year_of_selection)
lm_model <- lm(hours_mission ~ year_of_selection, data = new_data)
summary(lm_model)
## 
## Call:
## lm(formula = hours_mission ~ year_of_selection, data = new_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2107.1  -991.7  -668.3     9.0  9976.2 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -75234.45    7505.77  -10.02   <2e-16 ***
## year_of_selection     38.42       3.78   10.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1650 on 1275 degrees of freedom
## Multiple R-squared:  0.07495,    Adjusted R-squared:  0.07422 
## F-statistic: 103.3 on 1 and 1275 DF,  p-value: < 2.2e-16

The coefficients of the model will help us understand how much hours_mission changes with each additional year.

Residuals : The range of residuals goes from -2107.1 to 9976.2, meaning there are some large deviations between the actual and predicted mission hours for certain observations.

The median residual is -668.5, which suggests that on average, the predictions tend to underestimate the actual mission hours by about 668 hours.

Intercept (-75234.45):
This value represents the estimated hours_mission when the year_of_selection is zero. In practical terms, this value does not have a meaningful real-world interpretation because astronauts were not selected in year 0. The intercept mainly serves as a baseline for the model.

year_of_selection (38.42):
This coefficient suggests that for each additional year, the mission hours increase by approximately 38.42 hours. This is a positive and statistically significant relationship, indicating that astronauts selected in more recent years tend to have longer mission hours.

Both the intercept (p<0.001) and the slope for year_of_selection (p<0.001) are statistically significant.

The p-value for year_of_selection is extremely small (2.2e-16), meaning that there is strong evidence that year_of_selection is related to mission hours.

The R-squared value is 0.07422, which means that only about 7.5% of the variance in mission hours can be explained by the year of astronaut selection. This is a low value, indicating that while the relationship between year_of_selection and hours_mission is statistically significant, it only accounts for a small portion of the variation in mission hours.

The model indicates that astronauts selected in more recent years tend to have longer mission hours. For each additional year, there is an average increase of 38.42 mission hours. However, since the R-squared value is quite low, this variable alone is not a strong predictor of mission duration. Other factors (such as mission type, astronaut experience, or nationality) likely play a larger role in determining how long astronauts spend in space.

Recommendations:

While year_of_selection is significant, this model does not capture enough of the variation in mission hours to be considered highly predictive. It would be beneficial to include other variables (like mission type, nationality, or rank) in a regression model to improve the predictive power.