I found this week’s dataset about astronauts and space walks from kaggle. I think it would be interesting to see the relationship of the number of space walks an astronaut has completed and how doing so is determined/effected by their gender, number of space flight hours logged, status, and military rank. I will be specifically analyzing the number of space flight hours logged and gender.

Loading in the data and packages:

library(ggplot2)
library(texreg)
## Version:  1.36.23
## Date:     2017-03-03
## Author:   Philip Leifeld (University of Glasgow)
## 
## Please cite the JSS article in your publications -- see citation("texreg").
library(Zelig)
## Loading required package: survival
## 
## Attaching package: 'Zelig'
## The following object is masked from 'package:ggplot2':
## 
##     stat
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:texreg':
## 
##     extract
library(readr)
astronauts_1_ <- read_csv("C:/Users/abbys/Downloads/astronauts (1).csv")
## Parsed with column specification:
## cols(
##   Name = col_character(),
##   Year = col_double(),
##   Group = col_double(),
##   Status = col_character(),
##   `Birth Date` = col_character(),
##   `Birth Place` = col_character(),
##   Gender = col_character(),
##   `Alma Mater` = col_character(),
##   `Undergraduate Major` = col_character(),
##   `Graduate Major` = col_character(),
##   `Military Rank` = col_character(),
##   `Military Branch` = col_character(),
##   `Space Flights` = col_double(),
##   `Space Flight (hr)` = col_double(),
##   `Space Walks` = col_double(),
##   `Space Walks (hr)` = col_double(),
##   Missions = col_character(),
##   `Death Date` = col_character(),
##   `Death Mission` = col_character()
## )
head(astronauts_1_)
## # A tibble: 6 x 19
##   Name   Year Group Status `Birth Date` `Birth Place` Gender `Alma Mater`
##   <chr> <dbl> <dbl> <chr>  <chr>        <chr>         <chr>  <chr>       
## 1 Jose~  2004    19 Active 5/17/1967    Inglewood, CA Male   University ~
## 2 Lore~    NA    NA Retir~ 3/7/1936     Lewiston, MT  Male   Montana Sta~
## 3 Jame~  1984    10 Retir~ 3/3/1946     Warsaw, NY    Male   US Military~
## 4 Thom~  1987    12 Retir~ 5/20/1951    St. Louis, MO Male   University ~
## 5 Buzz~  1963     3 Retir~ 1/20/1930    Montclair, NJ Male   US Military~
## 6 Andr~  1987    12 Retir~ 8/4/1955     Philadelphia~ Male   Villanova U~
## # ... with 11 more variables: `Undergraduate Major` <chr>, `Graduate
## #   Major` <chr>, `Military Rank` <chr>, `Military Branch` <chr>, `Space
## #   Flights` <dbl>, `Space Flight (hr)` <dbl>, `Space Walks` <dbl>, `Space
## #   Walks (hr)` <dbl>, Missions <chr>, `Death Date` <chr>, `Death
## #   Mission` <chr>

Cleaning up the data:

astronauts1<-astronauts_1_ %>%
  filter(!is.na(`Space Flight (hr)`),
          !is.na(`Space Flights`),
          !is.na(`Space Walks`),
          !is.na(Year),
         !is.na(Gender),
         !is.na(Status),
         !is.na(`Military Rank`)
          )%>%
  rename(Space_Walks=`Space Walks`,
         Space_Flights=`Space Flights`,
         Space_Flight_hr=`Space Flight (hr)`,
         Space_Walks_hr=`Space Walks (hr)`,
         Military_Rank=`Military Rank`)%>%
  mutate(Gender=factor(Gender),Status = factor(Status),Military_Rank=factor(Military_Rank))

dim(astronauts1)
## [1] 330  19
head(astronauts1)
## # A tibble: 6 x 19
##   Name   Year Group Status `Birth Date` `Birth Place` Gender `Alma Mater`
##   <chr> <dbl> <dbl> <fct>  <chr>        <chr>         <fct>  <chr>       
## 1 Jose~  2004    19 Active 5/17/1967    Inglewood, CA Male   University ~
## 2 Jame~  1984    10 Retir~ 3/3/1946     Warsaw, NY    Male   US Military~
## 3 Thom~  1987    12 Retir~ 5/20/1951    St. Louis, MO Male   University ~
## 4 Buzz~  1963     3 Retir~ 1/20/1930    Montclair, NJ Male   US Military~
## 5 Andr~  1987    12 Retir~ 8/4/1955     Philadelphia~ Male   Villanova U~
## 6 Jose~  1967     6 Retir~ 6/27/1937    Crawsfordsvi~ Male   DePauw Univ~
## # ... with 11 more variables: `Undergraduate Major` <chr>, `Graduate
## #   Major` <chr>, Military_Rank <fct>, `Military Branch` <chr>,
## #   Space_Flights <dbl>, Space_Flight_hr <dbl>, Space_Walks <dbl>,
## #   Space_Walks_hr <dbl>, Missions <chr>, `Death Date` <chr>, `Death
## #   Mission` <chr>

I will be using a poisson model since my dependent variable of Space_Walks has counting data with values of 0 or greater.

Choosing the best model

The lowest AIC and BIC is model 4, so I will be using model 4 for my analysis.

z.out1 <- zelig(Space_Walks ~ Gender, model = "poisson", data = astronauts1, cite = F)
z.out2 <- zelig(Space_Walks ~ Gender+Status, model = "poisson", data = astronauts1, cite = F)
z.out3 <- zelig(Space_Walks ~ Gender+Status+Military_Rank, model = "poisson", data = astronauts1, cite = F)
z.out4 <- zelig(Space_Walks ~ Space_Flight_hr+Gender+Status+Military_Rank, model = "poisson", data = astronauts1, cite = F)
htmlreg(list(z.out1,z.out2,z.out3,z.out4),docutype=FALSE)
Statistical models
Model 1 Model 2 Model 3 Model 4
(Intercept) -0.47* 0.14 -0.15 -0.75
(0.18) (0.19) (0.49) (0.50)
GenderMale 0.86*** 1.05*** 1.30*** 1.24***
(0.19) (0.19) (0.19) (0.19)
StatusDeceased -2.33*** -2.26*** -1.51***
(0.28) (0.28) (0.31)
StatusManagement -0.51*** -0.54*** 0.04
(0.15) (0.15) (0.18)
StatusRetired -0.94*** -0.93*** -0.34*
(0.11) (0.11) (0.14)
Military_RankCaptain 0.04 -0.19
(0.46) (0.46)
Military_RankColonel -0.28 -0.39
(0.46) (0.46)
Military_RankCommander -0.46 -0.31
(0.52) (0.52)
Military_RankLieutenant Colonel -0.57 -0.61
(0.63) (0.63)
Military_RankLieutenant Commander -15.90 -15.95
(1089.25) (1096.81)
Military_RankLieutenant General -0.46 -0.71
(1.10) (1.10)
Military_RankMajor -15.20 -15.30
(1051.68) (1051.65)
Military_RankMajor General -16.63 -16.62
(1047.88) (1045.58)
Military_RankNot In Military 0.39 0.27
(0.46) (0.46)
Military_RankRear Admiral 0.64 0.60
(0.73) (0.73)
Military_RankVice Admiral -16.53 -16.49
(2103.36) (2103.36)
Space_Flight_hr 0.00***
(0.00)
AIC 1334.11 1205.18 1161.68 1101.56
BIC 1341.71 1224.17 1222.46 1166.15
Log Likelihood -665.06 -597.59 -564.84 -533.78
Deviance 938.62 803.68 738.18 676.07
Num. obs. 330 330 330 330
p < 0.001, p < 0.01, p < 0.05

I am going to be looking at the most statistically significant variables, which are Space_Flight_hr and Gender for my analysis.

Space_Flight_hr Difference in Number of Space Walks Taken

I want to see how the number of space flight hours logged correlates with how many space walks have been completed by astronauts. I am setting counterfactuals of Space_Flight_hr mean vs Space_Flight_hr mean + 1 standard deviation, and Space_Flight_hr mean vs Space_Flight_hr mean + 2 standard deviations, and Space_Flight_hr mean vs Space_Flight_hr mean + 3 standard deviations.

Mean vs Mean+1 Standard Deviation:

Are people who are 1 standard deviation above the mean for space flight hrs more likely than those at the mean to have completed more space walks? Mean average is 1.9, mean + 1 standard deviation average number of space flight hours is 2.6 space walks (more space flight hours predicts more space walks). If you are 1 standard deviation above the avg, then you complete .7 more space walks, on average.

z.out4 <- zelig(Space_Walks ~ Space_Flight_hr+Gender+Status+Military_Rank, model = "poisson", data = astronauts1, cite = F)
x<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr))
x1<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr)+(sd(astronauts1$Space_Flight_hr)))
s1<-sim(z.out4,x=x,x1=x1)
summary(s1)
## 
##  sim x :
##  -----
## ev
##         mean        sd      50%     2.5%    97.5%
## [1,] 1.91118 0.1566235 1.905723 1.633986 2.229349
## pv
##       mean       sd 50% 2.5% 97.5%
## [1,] 1.929 1.416321   2    0     5
## 
##  sim x1 :
##  -----
## ev
##          mean       sd      50%     2.5%    97.5%
## [1,] 2.612319 0.224598 2.601627 2.201223 3.060986
## pv
##       mean       sd 50% 2.5% 97.5%
## [1,] 2.573 1.616566   2    0     6
## fd
##           mean         sd       50%      2.5%     97.5%
## [1,] 0.7011387 0.09559223 0.6951531 0.5301676 0.9100785

Mean vs Mean+2 Standard Deviations:

Mean average is 1.9, mean + 2 standard deviation average number of space flight hours is 3.58 space walks. If you are 2 standard deviations above the mean, then you compelete 1.67 more space walks, on average.

x<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr))
x1<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr)+(2*sd(astronauts1$Space_Flight_hr)))
s2<-sim(z.out4,x=x,x1=x1)

summary(s2)
## 
##  sim x :
##  -----
## ev
##          mean        sd     50%     2.5%    97.5%
## [1,] 1.912986 0.1592523 1.90885 1.624212 2.238659
## pv
##       mean       sd 50% 2.5% 97.5%
## [1,] 1.897 1.397263   2    0     5
## 
##  sim x1 :
##  -----
## ev
##          mean       sd      50%     2.5%    97.5%
## [1,] 3.583403 0.362065 3.567036 2.936899 4.342661
## pv
##       mean       sd 50% 2.5% 97.5%
## [1,] 3.631 1.967921   3    0     8
## fd
##          mean        sd      50%     2.5%    97.5%
## [1,] 1.670417 0.2592069 1.657284 1.221215 2.218257

Mean vs Mean+3 standard deviations

Mean average is 1.9, mean + 3 standard deviations average number of space flight hours is 4.9 space walks. If you are 3 standard deviation above the average, then you complete 3 more space walks, on average.

x<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr))
x1<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr)+(3*sd(astronauts1$Space_Flight_hr)))
s3<-sim(z.out4,x=x,x1=x1)

summary(s3)
## 
##  sim x :
##  -----
## ev
##          mean        sd      50%     2.5%    97.5%
## [1,] 1.907846 0.1644072 1.900581 1.605457 2.252538
## pv
##       mean       sd 50% 2.5% 97.5%
## [1,] 1.964 1.345554   2    0     5
## 
##  sim x1 :
##  -----
## ev
##          mean        sd      50%     2.5%    97.5%
## [1,] 4.894271 0.5847786 4.883474 3.859309 6.109611
## pv
##       mean       sd 50% 2.5% 97.5%
## [1,] 4.917 2.281955   5    1    10
## fd
##          mean       sd      50%     2.5%    97.5%
## [1,] 2.986425 0.493559 2.964175 2.090323 4.025394

Setting up counterfactuals and simulations for scatterplot and histogram:

xm<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr))
xsd1<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr)+sd(astronauts1$Space_Flight_hr))
sd1<-sim(z.out4,x=xm,x1=xsd1)
fsd1 <- sd1$get_qi(xvalue="x1", qi="fd")

xm<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr))
xsd2<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr)+2*sd(astronauts1$Space_Flight_hr))
sd2<-sim(z.out4,x=xm,x1=xsd2)
fsd2 <- sd2$get_qi(xvalue="x1", qi="fd")

xm<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr))
xsd3<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr)+(3*sd(astronauts1$Space_Flight_hr)))
sd3<-sim(z.out4,x=xm,x1=xsd3)
fsd3 <- sd3$get_qi(xvalue="x1", qi="fd")

Setting up Scatter plot and Histogram:

I see on the scattor plot that there is a positive correlation between the average number of space walks an astronaut has done and their total number of hours of space flight, on average. The histogram shows that as total number of space flight hours for astronauts increases, the total number of astronauts that have completed space walks increases, on average. When increasing the standard deviation of space flight hours by 1, the number of space walks increases as well, on average.

d <- as.data.frame(cbind(fsd1, fsd2, fsd3))%>%
  rename("Avg + 1 SD Space_Flight_hr"=V1,
         "Avg + 2 SD_Space_flight_hr"=V2,
          "Avg + 3 SD_Space_Flight_hr"=V3)
head(d)
##   Avg + 1 SD Space_Flight_hr Avg + 2 SD_Space_flight_hr
## 1                  0.6729308                   1.690023
## 2                  1.0207631                   1.337347
## 3                  0.6766981                   1.744243
## 4                  0.6837467                   2.014837
## 5                  0.5854133                   1.986321
## 6                  0.6136207                   1.930837
##   Avg + 3 SD_Space_Flight_hr
## 1                   2.356014
## 2                   3.581960
## 3                   2.535662
## 4                   3.701167
## 5                   3.802460
## 6                   3.237260
tidd <- d %>% 
  gather(Flight_Hours, Space_Walk, 1:3)
head(tidd)
##                 Flight_Hours Space_Walk
## 1 Avg + 1 SD Space_Flight_hr  0.6729308
## 2 Avg + 1 SD Space_Flight_hr  1.0207631
## 3 Avg + 1 SD Space_Flight_hr  0.6766981
## 4 Avg + 1 SD Space_Flight_hr  0.6837467
## 5 Avg + 1 SD Space_Flight_hr  0.5854133
## 6 Avg + 1 SD Space_Flight_hr  0.6136207
tidd %>% 
  group_by(Flight_Hours) %>% 
  summarise(mean = mean(Space_Walk), sd = sd(Space_Walk))
## # A tibble: 3 x 3
##   Flight_Hours                mean     sd
##   <chr>                      <dbl>  <dbl>
## 1 Avg + 1 SD Space_Flight_hr 0.706 0.0974
## 2 Avg + 2 SD_Space_flight_hr 1.66  0.253 
## 3 Avg + 3 SD_Space_Flight_hr 2.98  0.492
tidd%>%
  ggplot()+geom_histogram(aes(x=Space_Walk))+facet_wrap(~Flight_Hours,ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

tidd%>%
  group_by(Flight_Hours)%>%
  summarise(Space_Walk=mean(Space_Walk))%>%
  ggplot()+geom_point(aes(x=Flight_Hours,y=Space_Walk))

Gender differences in number of space walks completed

According to https://www.politico.com/newsletters/women-rule/2019/03/29/nasa-spacesuits-and-a-world-designed-for-men-418357, the first all female space walk was scheduled for 3/29/19, but it could not be accomplished since the were not enough space suits on the International Space Station designed for women. I will be demonstrating the gender differences below.

The first difference between males and females shows that males complete an average of 1.9 space walks and females complete an average of .58 space walks. Male astronauts complete 1.3 more space walks than females do, on average. The averages confirm a similar story as in the article above.

z.out4 <- zelig(Space_Walks ~ Space_Flight_hr+Gender+Status+Military_Rank, model = "poisson", data = astronauts1, cite = F)

xf<-setx(z.out4,Gender="Female")
sf<-sim(z.out4,x=xf)

xm<-setx(z.out4,Gender="Male")
sm<-sim(z.out4,x=xf,x1=xm)
fdm<- sm$get_qi(xvalue="x1", qi="fd")
summary(sm)
## 
##  sim x :
##  -----
## ev
##           mean        sd       50%      2.5%     97.5%
## [1,] 0.5590109 0.1154343 0.5492171 0.3705119 0.8178996
## pv
##       mean        sd 50% 2.5% 97.5%
## [1,] 0.532 0.7453399   0    0     2
## 
##  sim x1 :
##  -----
## ev
##          mean       sd      50%     2.5%    97.5%
## [1,] 1.915505 0.168059 1.914736 1.606066 2.243785
## pv
##       mean       sd 50% 2.5% 97.5%
## [1,] 2.062 1.428354   2    0     5
## fd
##          mean        sd      50%     2.5%    97.5%
## [1,] 1.356494 0.1768487 1.357769 1.016348 1.694104
d2 <- as.data.frame(cbind(fdm))
head(d2)
##          V1
## 1 1.1156505
## 2 1.6059349
## 3 0.7821102
## 4 1.4742447
## 5 1.2466700
## 6 1.4833298
tidd <- d2 %>% 
  gather(Gender, Space_Walk, 1)

s_ms<-tidd %>% 
  group_by(Gender)%>% 
  summarise(mean = mean(Space_Walk), sd = sd(Space_Walk))

s_ms
## # A tibble: 1 x 3
##   Gender  mean    sd
##   <chr>  <dbl> <dbl>
## 1 V1      1.36 0.177

Plotting the first difference by gender as a histogram

tidd%>%
  ggplot()+geom_histogram(aes(x=Space_Walk))+facet_wrap(~Gender)+geom_vline(data=s_ms,aes(xintercept=mean),color="red") + xlab("First Difference Average Number of Space Walks By Gender")+theme_bw()

###Plotting gender differences by expected value I am going to also graph the expected values of each gender’s average number of space walks taken, based on the explanations above.

z.out4 <- zelig(Space_Walks ~ Space_Flight_hr+Gender+Status+Military_Rank, model = "poisson", data = astronauts1, cite = F)

xf<-setx(z.out4,Gender="Female")
sf<-sim(z.out4,x=xm,x1=xf)
evf<-sf$get_qi(xvalue="x1",qi="ev")

xm<-setx(z.out4,Gender="Male")
sm<-sim(z.out4,x=xm)
evm <- sf$get_qi(xvalue="x", qi="ev")
summary(sm)
## 
##  sim x :
##  -----
## ev
##          mean        sd      50%     2.5%    97.5%
## [1,] 1.908238 0.1625393 1.901703 1.610768 2.256415
## pv
##       mean       sd 50% 2.5% 97.5%
## [1,] 1.926 1.380735   2    0     5
d3 <- as.data.frame(cbind(evf,evm))%>%
  rename("Female"=V1,
        "Male"=V2)
head(d3)
##      Female     Male
## 1 0.6276570 2.060944
## 2 0.6483469 2.022910
## 3 0.6000663 1.810451
## 4 0.7087456 2.004331
## 5 0.6137137 1.901496
## 6 0.6029566 2.081129
tidd <- d3 %>% 
  gather(Gender, Space_Walk, 1:2)
s_ms2<-tidd%>%
  group_by(Gender)%>% 
  summarise(mean = mean(Space_Walk), sd = sd(Space_Walk))

Plotting the expected value by gender as a histogram

tidd%>%
  ggplot()+geom_histogram(aes(x=Space_Walk))+facet_wrap(~Gender)+geom_vline(data=s_ms2,aes(xintercept=mean),color="Pink") + xlab("Expected Value Average Number of Space Walks By Gender")+theme_bw()

Conclusions and Limitations:

As you can see, an astronaut’s number of space flight hours positively correlates with how many space walks they’ve done. Gender also plays a large factor in whether or not an astronaut has completed a space walk.

The dataset I chose has a limited number of samples because it is looking at Astronauts, who make up a very small percentage of the population. Completing a space walk is reserved for the most qualified people. Variables like Space Flights and Space Flight (hrs) are limiting in that they are functions of each other and I cannot use both variables in the same model. Other variables, such as year, do not make sense in this dataset since astronauts can complete space walks in different years and each row is 1 astronaut. The years are important since it would show when each space walk occured, but this dataset doesn’t show space walks by years. NASA’s funding has been cut in the last few years, decreasing the number of space walks because of a lack of funds. Perhaps another variable such as average NASA funding during the person’s astronaut years would be an important factor on how many times an astronaut completed space walks. Another important limitation is that the dataset does not inclue the space walk event reason. Basically, I have no idea what the context of what these space walks are, which can also be a predictive factor on the number of space walks for an astronaut.