library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
astro <- read_delim('/Users/sneha/H510-Statistics/astronaut-data.csv')
## Rows: 1277 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, nationality, military_civilian, selection, occupation, ...
## dbl (13): id, number, nationwide_number, year_of_birth, year_of_selection, m...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

we convert the “military_civilian” column to binary: 1 for military, 0 for civilian. I chose miliary_civilian variable because it can be easily converted to a binary variable as the column contains either “military” or “civilian”

astro$military_binary <- ifelse(astro$military_civilian == "military", 1, 0)

Building a logistic regression model with up to four explanatory variables.

Using “military_binary” as the response variable, and explanatory variables like “sex”, “year_of_birth”, “total_number_of_missions”, “hours_mission”, we create a logistic regression model.

logistic_model <- glm(military_binary ~ sex + year_of_birth + total_number_of_missions + hours_mission, data = astro, family = binomial)

Displaying summary

summary(logistic_model)
## 
## Call:
## glm(formula = military_binary ~ sex + year_of_birth + total_number_of_missions + 
##     hours_mission, family = binomial, data = astro)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               7.854e+00  1.099e+01   0.715    0.475    
## sexmale                   1.623e+00  2.049e-01   7.924  2.3e-15 ***
## year_of_birth            -4.501e-03  5.615e-03  -0.802    0.423    
## total_number_of_missions -3.050e-02  4.221e-02  -0.723    0.470    
## hours_mission            -6.174e-06  3.661e-05  -0.169    0.866    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1716.6  on 1276  degrees of freedom
## Residual deviance: 1636.8  on 1272  degrees of freedom
## AIC: 1646.8
## 
## Number of Fisher Scoring iterations: 4

The intercept’s estimate is 7.854, which represents the log-odds of the outcome being military when all predictors are zero. but, its p-value (0.475) suggests that the intercept is not significant, meaning it does not contribute much to predicting the outcome in this model.

The coefficient for sex(male) is 1.623 with a p-value of 2.3e-15, which is highly significant. This positive and significant coefficient suggests that being male significantly increases the log-odds of the outcome (being military) compared to females.

Hence, males are more likely to be in the military category than females. This is an interesting insight!

The coefficient for year_of_birth is -0.0045, and it is not significant (p-value of 0.423). This means that year_of_birth does not have a meaningful impact on predicting whether someone is military or civilian in this model.

The coefficient for total_number_of_missions is -0.0305, but it is also not significant (p-value of 0.470). This suggests that the total number of missions does not significantly influence the likelihood of being in the military or civilian group.

The coefficient for hours_mission is -6.174e-06, and it is also not statistically significant (p-value of 0.866). This indicates that the hours spent on missions do not significantly affect the probability of being military.

Null deviance (1716.6) and Residual deviance (1636.8): The reduction in deviance suggests the model explains some variability in the data, but it may not be highly predictive due to many predictors being insignificant.

AIC (1636): A lower AIC is generally preferable.

Insights from above model :

The only statistically significant predictor in the model is sex, specifically being male, which positively impacts the likelihood of being classified as military. Other variables (year of birth, total missions, hours on mission) do not have a significant impact. This suggests that while the model may slightly distinguish between military and civilian status based on sex, it might not capture the relationship effectively with the current set of predictors.

Using the Standard Error for the coefficient of “total_number_of_missions” to create a 95% CI

coeff_estimate <- coef(logistic_model)["total_number_of_missions"]
std_error <- summary(logistic_model)$coefficients["total_number_of_missions", "Std. Error"]
conf_interval <- coeff_estimate + c(-1.96, 1.96) * std_error
conf_interval
## [1] -0.11322679  0.05223369

Here, c(-1.96, 1.96) represents the critical values for a 95% confidence interval in a normal distribution

Since the confidence interval includes zero, this suggests that the effect of total_number_of_missions could be zero, meaning that this predictor may have no significant effect on the probability of being in the military.

Using the Standard Error for the coefficient of “sex” to create a 95% CI

coeff_estimate <- coef(logistic_model)["sexmale"]
std_error <- summary(logistic_model)$coefficients["sexmale", "Std. Error"]
conf_interval <- coeff_estimate + c(-1.96, 1.96) * std_error
conf_interval
## [1] 1.221844 2.024958

As we can see the interval is above zero ;

Since the entire interval is above zero, we can say that the effect of being male (opposite to female) on the log-odds of being in the military is positive. This suggests that gender (specifically male) is significantly associated with an increased likelihood of being in the military.