I found this week’s dataset about astronauts and space walks from kaggle. I think it would be interesting to see the relationship of the number of space walks an astronaut has completed and how doing so is determined/effected by their gender, number of space flight hours logged, status, and military rank. I will be specifically analyzing the number of space flight hours logged and gender.
library(ggplot2)
library(texreg)
## Version: 1.36.23
## Date: 2017-03-03
## Author: Philip Leifeld (University of Glasgow)
##
## Please cite the JSS article in your publications -- see citation("texreg").
library(Zelig)
## Loading required package: survival
##
## Attaching package: 'Zelig'
## The following object is masked from 'package:ggplot2':
##
## stat
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:texreg':
##
## extract
library(readr)
astronauts_1_ <- read_csv("C:/Users/abbys/Downloads/astronauts (1).csv")
## Parsed with column specification:
## cols(
## Name = col_character(),
## Year = col_double(),
## Group = col_double(),
## Status = col_character(),
## `Birth Date` = col_character(),
## `Birth Place` = col_character(),
## Gender = col_character(),
## `Alma Mater` = col_character(),
## `Undergraduate Major` = col_character(),
## `Graduate Major` = col_character(),
## `Military Rank` = col_character(),
## `Military Branch` = col_character(),
## `Space Flights` = col_double(),
## `Space Flight (hr)` = col_double(),
## `Space Walks` = col_double(),
## `Space Walks (hr)` = col_double(),
## Missions = col_character(),
## `Death Date` = col_character(),
## `Death Mission` = col_character()
## )
head(astronauts_1_)
## # A tibble: 6 x 19
## Name Year Group Status `Birth Date` `Birth Place` Gender `Alma Mater`
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 Jose~ 2004 19 Active 5/17/1967 Inglewood, CA Male University ~
## 2 Lore~ NA NA Retir~ 3/7/1936 Lewiston, MT Male Montana Sta~
## 3 Jame~ 1984 10 Retir~ 3/3/1946 Warsaw, NY Male US Military~
## 4 Thom~ 1987 12 Retir~ 5/20/1951 St. Louis, MO Male University ~
## 5 Buzz~ 1963 3 Retir~ 1/20/1930 Montclair, NJ Male US Military~
## 6 Andr~ 1987 12 Retir~ 8/4/1955 Philadelphia~ Male Villanova U~
## # ... with 11 more variables: `Undergraduate Major` <chr>, `Graduate
## # Major` <chr>, `Military Rank` <chr>, `Military Branch` <chr>, `Space
## # Flights` <dbl>, `Space Flight (hr)` <dbl>, `Space Walks` <dbl>, `Space
## # Walks (hr)` <dbl>, Missions <chr>, `Death Date` <chr>, `Death
## # Mission` <chr>
astronauts1<-astronauts_1_ %>%
filter(!is.na(`Space Flight (hr)`),
!is.na(`Space Flights`),
!is.na(`Space Walks`),
!is.na(Year),
!is.na(Gender),
!is.na(Status),
!is.na(`Military Rank`)
)%>%
rename(Space_Walks=`Space Walks`,
Space_Flights=`Space Flights`,
Space_Flight_hr=`Space Flight (hr)`,
Space_Walks_hr=`Space Walks (hr)`,
Military_Rank=`Military Rank`)%>%
mutate(Gender=factor(Gender),Status = factor(Status),Military_Rank=factor(Military_Rank))
dim(astronauts1)
## [1] 330 19
head(astronauts1)
## # A tibble: 6 x 19
## Name Year Group Status `Birth Date` `Birth Place` Gender `Alma Mater`
## <chr> <dbl> <dbl> <fct> <chr> <chr> <fct> <chr>
## 1 Jose~ 2004 19 Active 5/17/1967 Inglewood, CA Male University ~
## 2 Jame~ 1984 10 Retir~ 3/3/1946 Warsaw, NY Male US Military~
## 3 Thom~ 1987 12 Retir~ 5/20/1951 St. Louis, MO Male University ~
## 4 Buzz~ 1963 3 Retir~ 1/20/1930 Montclair, NJ Male US Military~
## 5 Andr~ 1987 12 Retir~ 8/4/1955 Philadelphia~ Male Villanova U~
## 6 Jose~ 1967 6 Retir~ 6/27/1937 Crawsfordsvi~ Male DePauw Univ~
## # ... with 11 more variables: `Undergraduate Major` <chr>, `Graduate
## # Major` <chr>, Military_Rank <fct>, `Military Branch` <chr>,
## # Space_Flights <dbl>, Space_Flight_hr <dbl>, Space_Walks <dbl>,
## # Space_Walks_hr <dbl>, Missions <chr>, `Death Date` <chr>, `Death
## # Mission` <chr>
I will be using a poisson model since my dependent variable of Space_Walks has counting data with values of 0 or greater.
The lowest AIC and BIC is model 4, so I will be using model 4 for my analysis.
z.out1 <- zelig(Space_Walks ~ Gender, model = "poisson", data = astronauts1, cite = F)
z.out2 <- zelig(Space_Walks ~ Gender+Status, model = "poisson", data = astronauts1, cite = F)
z.out3 <- zelig(Space_Walks ~ Gender+Status+Military_Rank, model = "poisson", data = astronauts1, cite = F)
z.out4 <- zelig(Space_Walks ~ Space_Flight_hr+Gender+Status+Military_Rank, model = "poisson", data = astronauts1, cite = F)
htmlreg(list(z.out1,z.out2,z.out3,z.out4),docutype=FALSE)
| Model 1 | Model 2 | Model 3 | Model 4 | ||
|---|---|---|---|---|---|
| (Intercept) | -0.47* | 0.14 | -0.15 | -0.75 | |
| (0.18) | (0.19) | (0.49) | (0.50) | ||
| GenderMale | 0.86*** | 1.05*** | 1.30*** | 1.24*** | |
| (0.19) | (0.19) | (0.19) | (0.19) | ||
| StatusDeceased | -2.33*** | -2.26*** | -1.51*** | ||
| (0.28) | (0.28) | (0.31) | |||
| StatusManagement | -0.51*** | -0.54*** | 0.04 | ||
| (0.15) | (0.15) | (0.18) | |||
| StatusRetired | -0.94*** | -0.93*** | -0.34* | ||
| (0.11) | (0.11) | (0.14) | |||
| Military_RankCaptain | 0.04 | -0.19 | |||
| (0.46) | (0.46) | ||||
| Military_RankColonel | -0.28 | -0.39 | |||
| (0.46) | (0.46) | ||||
| Military_RankCommander | -0.46 | -0.31 | |||
| (0.52) | (0.52) | ||||
| Military_RankLieutenant Colonel | -0.57 | -0.61 | |||
| (0.63) | (0.63) | ||||
| Military_RankLieutenant Commander | -15.90 | -15.95 | |||
| (1089.25) | (1096.81) | ||||
| Military_RankLieutenant General | -0.46 | -0.71 | |||
| (1.10) | (1.10) | ||||
| Military_RankMajor | -15.20 | -15.30 | |||
| (1051.68) | (1051.65) | ||||
| Military_RankMajor General | -16.63 | -16.62 | |||
| (1047.88) | (1045.58) | ||||
| Military_RankNot In Military | 0.39 | 0.27 | |||
| (0.46) | (0.46) | ||||
| Military_RankRear Admiral | 0.64 | 0.60 | |||
| (0.73) | (0.73) | ||||
| Military_RankVice Admiral | -16.53 | -16.49 | |||
| (2103.36) | (2103.36) | ||||
| Space_Flight_hr | 0.00*** | ||||
| (0.00) | |||||
| AIC | 1334.11 | 1205.18 | 1161.68 | 1101.56 | |
| BIC | 1341.71 | 1224.17 | 1222.46 | 1166.15 | |
| Log Likelihood | -665.06 | -597.59 | -564.84 | -533.78 | |
| Deviance | 938.62 | 803.68 | 738.18 | 676.07 | |
| Num. obs. | 330 | 330 | 330 | 330 | |
| p < 0.001, p < 0.01, p < 0.05 | |||||
I am going to be looking at the most statistically significant variables, which are Space_Flight_hr and Gender for my analysis.
I want to see how the number of space flight hours logged correlates with how many space walks have been completed by astronauts. I am setting counterfactuals of Space_Flight_hr mean vs Space_Flight_hr mean + 1 standard deviation, and Space_Flight_hr mean vs Space_Flight_hr mean + 2 standard deviations, and Space_Flight_hr mean vs Space_Flight_hr mean + 3 standard deviations.
Are people who are 1 standard deviation above the mean for space flight hrs more likely than those at the mean to have completed more space walks? Mean average is 1.9, mean + 1 standard deviation average number of space flight hours is 2.6 space walks (more space flight hours predicts more space walks). If you are 1 standard deviation above the avg, then you complete .7 more space walks, on average.
z.out4 <- zelig(Space_Walks ~ Space_Flight_hr+Gender+Status+Military_Rank, model = "poisson", data = astronauts1, cite = F)
x<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr))
x1<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr)+(sd(astronauts1$Space_Flight_hr)))
s1<-sim(z.out4,x=x,x1=x1)
summary(s1)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 1.91118 0.1566235 1.905723 1.633986 2.229349
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 1.929 1.416321 2 0 5
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 2.612319 0.224598 2.601627 2.201223 3.060986
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 2.573 1.616566 2 0 6
## fd
## mean sd 50% 2.5% 97.5%
## [1,] 0.7011387 0.09559223 0.6951531 0.5301676 0.9100785
Mean average is 1.9, mean + 2 standard deviation average number of space flight hours is 3.58 space walks. If you are 2 standard deviations above the mean, then you compelete 1.67 more space walks, on average.
x<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr))
x1<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr)+(2*sd(astronauts1$Space_Flight_hr)))
s2<-sim(z.out4,x=x,x1=x1)
summary(s2)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 1.912986 0.1592523 1.90885 1.624212 2.238659
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 1.897 1.397263 2 0 5
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 3.583403 0.362065 3.567036 2.936899 4.342661
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 3.631 1.967921 3 0 8
## fd
## mean sd 50% 2.5% 97.5%
## [1,] 1.670417 0.2592069 1.657284 1.221215 2.218257
Mean average is 1.9, mean + 3 standard deviations average number of space flight hours is 4.9 space walks. If you are 3 standard deviation above the average, then you complete 3 more space walks, on average.
x<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr))
x1<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr)+(3*sd(astronauts1$Space_Flight_hr)))
s3<-sim(z.out4,x=x,x1=x1)
summary(s3)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 1.907846 0.1644072 1.900581 1.605457 2.252538
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 1.964 1.345554 2 0 5
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 4.894271 0.5847786 4.883474 3.859309 6.109611
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 4.917 2.281955 5 1 10
## fd
## mean sd 50% 2.5% 97.5%
## [1,] 2.986425 0.493559 2.964175 2.090323 4.025394
xm<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr))
xsd1<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr)+sd(astronauts1$Space_Flight_hr))
sd1<-sim(z.out4,x=xm,x1=xsd1)
fsd1 <- sd1$get_qi(xvalue="x1", qi="fd")
xm<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr))
xsd2<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr)+2*sd(astronauts1$Space_Flight_hr))
sd2<-sim(z.out4,x=xm,x1=xsd2)
fsd2 <- sd2$get_qi(xvalue="x1", qi="fd")
xm<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr))
xsd3<-setx(z.out4,Space_Flight_hr=mean(astronauts1$Space_Flight_hr)+(3*sd(astronauts1$Space_Flight_hr)))
sd3<-sim(z.out4,x=xm,x1=xsd3)
fsd3 <- sd3$get_qi(xvalue="x1", qi="fd")
I see on the scattor plot that there is a positive correlation between the average number of space walks an astronaut has done and their total number of hours of space flight, on average. The histogram shows that as total number of space flight hours for astronauts increases, the total number of astronauts that have completed space walks increases, on average. When increasing the standard deviation of space flight hours by 1, the number of space walks increases as well, on average.
d <- as.data.frame(cbind(fsd1, fsd2, fsd3))%>%
rename("Avg + 1 SD Space_Flight_hr"=V1,
"Avg + 2 SD_Space_flight_hr"=V2,
"Avg + 3 SD_Space_Flight_hr"=V3)
head(d)
## Avg + 1 SD Space_Flight_hr Avg + 2 SD_Space_flight_hr
## 1 0.6729308 1.690023
## 2 1.0207631 1.337347
## 3 0.6766981 1.744243
## 4 0.6837467 2.014837
## 5 0.5854133 1.986321
## 6 0.6136207 1.930837
## Avg + 3 SD_Space_Flight_hr
## 1 2.356014
## 2 3.581960
## 3 2.535662
## 4 3.701167
## 5 3.802460
## 6 3.237260
tidd <- d %>%
gather(Flight_Hours, Space_Walk, 1:3)
head(tidd)
## Flight_Hours Space_Walk
## 1 Avg + 1 SD Space_Flight_hr 0.6729308
## 2 Avg + 1 SD Space_Flight_hr 1.0207631
## 3 Avg + 1 SD Space_Flight_hr 0.6766981
## 4 Avg + 1 SD Space_Flight_hr 0.6837467
## 5 Avg + 1 SD Space_Flight_hr 0.5854133
## 6 Avg + 1 SD Space_Flight_hr 0.6136207
tidd %>%
group_by(Flight_Hours) %>%
summarise(mean = mean(Space_Walk), sd = sd(Space_Walk))
## # A tibble: 3 x 3
## Flight_Hours mean sd
## <chr> <dbl> <dbl>
## 1 Avg + 1 SD Space_Flight_hr 0.706 0.0974
## 2 Avg + 2 SD_Space_flight_hr 1.66 0.253
## 3 Avg + 3 SD_Space_Flight_hr 2.98 0.492
tidd%>%
ggplot()+geom_histogram(aes(x=Space_Walk))+facet_wrap(~Flight_Hours,ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
tidd%>%
group_by(Flight_Hours)%>%
summarise(Space_Walk=mean(Space_Walk))%>%
ggplot()+geom_point(aes(x=Flight_Hours,y=Space_Walk))
According to https://www.politico.com/newsletters/women-rule/2019/03/29/nasa-spacesuits-and-a-world-designed-for-men-418357, the first all female space walk was scheduled for 3/29/19, but it could not be accomplished since the were not enough space suits on the International Space Station designed for women. I will be demonstrating the gender differences below.
The first difference between males and females shows that males complete an average of 1.9 space walks and females complete an average of .58 space walks. Male astronauts complete 1.3 more space walks than females do, on average. The averages confirm a similar story as in the article above.
z.out4 <- zelig(Space_Walks ~ Space_Flight_hr+Gender+Status+Military_Rank, model = "poisson", data = astronauts1, cite = F)
xf<-setx(z.out4,Gender="Female")
sf<-sim(z.out4,x=xf)
xm<-setx(z.out4,Gender="Male")
sm<-sim(z.out4,x=xf,x1=xm)
fdm<- sm$get_qi(xvalue="x1", qi="fd")
summary(sm)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 0.5590109 0.1154343 0.5492171 0.3705119 0.8178996
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 0.532 0.7453399 0 0 2
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 1.915505 0.168059 1.914736 1.606066 2.243785
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 2.062 1.428354 2 0 5
## fd
## mean sd 50% 2.5% 97.5%
## [1,] 1.356494 0.1768487 1.357769 1.016348 1.694104
d2 <- as.data.frame(cbind(fdm))
head(d2)
## V1
## 1 1.1156505
## 2 1.6059349
## 3 0.7821102
## 4 1.4742447
## 5 1.2466700
## 6 1.4833298
tidd <- d2 %>%
gather(Gender, Space_Walk, 1)
s_ms<-tidd %>%
group_by(Gender)%>%
summarise(mean = mean(Space_Walk), sd = sd(Space_Walk))
s_ms
## # A tibble: 1 x 3
## Gender mean sd
## <chr> <dbl> <dbl>
## 1 V1 1.36 0.177
tidd%>%
ggplot()+geom_histogram(aes(x=Space_Walk))+facet_wrap(~Gender)+geom_vline(data=s_ms,aes(xintercept=mean),color="red") + xlab("First Difference Average Number of Space Walks By Gender")+theme_bw()
###Plotting gender differences by expected value I am going to also graph the expected values of each gender’s average number of space walks taken, based on the explanations above.
z.out4 <- zelig(Space_Walks ~ Space_Flight_hr+Gender+Status+Military_Rank, model = "poisson", data = astronauts1, cite = F)
xf<-setx(z.out4,Gender="Female")
sf<-sim(z.out4,x=xm,x1=xf)
evf<-sf$get_qi(xvalue="x1",qi="ev")
xm<-setx(z.out4,Gender="Male")
sm<-sim(z.out4,x=xm)
evm <- sf$get_qi(xvalue="x", qi="ev")
summary(sm)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 1.908238 0.1625393 1.901703 1.610768 2.256415
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 1.926 1.380735 2 0 5
d3 <- as.data.frame(cbind(evf,evm))%>%
rename("Female"=V1,
"Male"=V2)
head(d3)
## Female Male
## 1 0.6276570 2.060944
## 2 0.6483469 2.022910
## 3 0.6000663 1.810451
## 4 0.7087456 2.004331
## 5 0.6137137 1.901496
## 6 0.6029566 2.081129
tidd <- d3 %>%
gather(Gender, Space_Walk, 1:2)
s_ms2<-tidd%>%
group_by(Gender)%>%
summarise(mean = mean(Space_Walk), sd = sd(Space_Walk))
tidd%>%
ggplot()+geom_histogram(aes(x=Space_Walk))+facet_wrap(~Gender)+geom_vline(data=s_ms2,aes(xintercept=mean),color="Pink") + xlab("Expected Value Average Number of Space Walks By Gender")+theme_bw()
As you can see, an astronaut’s number of space flight hours positively correlates with how many space walks they’ve done. Gender also plays a large factor in whether or not an astronaut has completed a space walk.
The dataset I chose has a limited number of samples because it is looking at Astronauts, who make up a very small percentage of the population. Completing a space walk is reserved for the most qualified people. Variables like Space Flights and Space Flight (hrs) are limiting in that they are functions of each other and I cannot use both variables in the same model. Other variables, such as year, do not make sense in this dataset since astronauts can complete space walks in different years and each row is 1 astronaut. The years are important since it would show when each space walk occured, but this dataset doesn’t show space walks by years. NASA’s funding has been cut in the last few years, decreasing the number of space walks because of a lack of funds. Perhaps another variable such as average NASA funding during the person’s astronaut years would be an important factor on how many times an astronaut completed space walks. Another important limitation is that the dataset does not inclue the space walk event reason. Basically, I have no idea what the context of what these space walks are, which can also be a predictive factor on the number of space walks for an astronaut.