Visualize the List of NASA Astronauts

This is my work for DataViz Battle in Reddit’s (r/dataisbeautiful)[www.reddit.com/r/dataisbeautiful].
Disclaimer
* All the data was provide by r/dataisbeautiful;
* was done in R, with a lot of ggplot2 and dplyr;
* This is an original content
* Database;
* Database prerared in Excel.

I advise you to read this analysis by listening to David Bowie’s Space Oddity in honer to Chris Hadfield, who sang this song at the ISS:

thanks

0. Goals

As proposed by the challenge, I will develop a data visualization with the NASA astronaut database in commemoration for its 60th anniversary.

With reliable variables given, this exploratory data analysis (EDA) will investigate mainly two factors related to the characteristics of the astronauts:
* What guides the selection of astronauts regarding the year of selection, genre and civil status; * How the characteristics and selection group play a role in relation to time in space (perhaps the most palpable and objective metric for an astronaut’s success).

Variables

After analyzing the available base, the observed potential is in relation to astronauts’ characteristics, selection data and time of flight in the space. Thus, the main variables used will be:
* Selection Year
* Civil Status
* Gender
* Date Birth
* Space Flight Time
* Military Position
* Military Force

Data Reliability

I have tried to check if the database contains information that is true to reality. Unfortunately, there are some fields with many errors. This is a major limitation for developing visualization. The various most poorly reported and therefore I believe are not possible to be explored are:
* Missions Flown: too much inconsistency. Some astronauts have expediditions on the ISS, but without the flights to the ISS, I do not believe they got there otherwise.
* Job: many NAs. And astronauts can perform different jobs on each different mission.
* Status: especially in relation to “Deceased”, some astronauts died on mission and others after retirement. Not all deceased astronauts are reported. For example, Neil Armstrong has a status of “former”, but he died in 2012.

Given this situation, I would completely refuse to use these variables in this data visualization!

1. Data Input

Libraries

library(waffle)
library(dplyr)
library(ggplot2)
library(treemapify)
library(ggplotify)
library(ggExtra)
library(scales)
library(gganimate)
library(ggthemes)
library(ggExtra)

Let’s import the data and give a first look:

nasa <- read.csv("NASA astronauts v2.csv", sep = ";", header = TRUE)

If you want to check the data

head(nasa)
str(nasa)
summary(nasa)

Create some variables:

nasa$military_force <- ifelse(is.na(nasa$military_force), "civilian", as.character(nasa$military_force))
nasa$age_selection <- nasa$selection_year - nasa$year_birth

2. Summary of the groups selection by NASA

Nasa has selected 20 groups of astronauts since them:

selection_groups <- nasa %>% group_by(group, selection_year) %>% summarise(astrounauts = length(astronaut))
selection_groups$time_next <- c(diff(selection_groups$selection_year), 
                                2018 - selection_groups$selection_year[nrow(selection_groups)])
selection_groups$time_next <- as.numeric(selection_groups$time_next)

We can visualize how many have been selected by year/group

ggplot(selection_groups, aes(x=selection_year, y=astrounauts, fill=-astrounauts)) + 
  labs(title = "Number of astronauts selected by year/group", subtitle = "Columns Chart",
       x = "Selection Year", y = "# Astronauts Selected") +
  geom_col() +
  theme(legend.position="none")

How long does it take until the next selection?
One can notice 2 clear periodos for astronauts selection, pre and post 1969.
A brief look shows two great hiatus: 1969-1978 and 2009-now(2018):

selection_groups$period <- ifelse(selection_groups$selection_year <= 1969, "pre 1969", "post 1969")
selection_groups$period <- factor(selection_groups$period, levels = c("pre 1969", "post 1969"))

ggplot(selection_groups, aes(x=selection_year, y=time_next, color = period)) + 
  labs(title = "Time to the next selection", subtitle = "Lollipop Chart", x = "Selection Year", y = "Years") +
  geom_point(size=3) + 
  geom_segment(aes(x=selection_year, 
                   xend=selection_year, 
                   y=0, 
                   yend=time_next)) +
  scale_y_continuous(breaks = seq(0, 10, 2), lim = c(0, 10)) +
  theme(legend.position="top")

Let’s visualize these to vairables combined.
Is there a connection between the number of astronauts selected and the time for the next selection?
Intuitively, the more astronauts selected, the longer the time to the next group. But perhaps it is possible to affirm that there are more astronauts selected in periods of expansion of the space program, and vice versa.

ggplot(selection_groups, aes(x=astrounauts, y=jitter(time_next), color = period)) + 
  labs(title="# Astronauts and Time to Next selection",  subtitle = "Scatterplot",
       x = "# of Astronauts Selected", y = "Time to Next Selection", color = "Period") +
  geom_point(size = 3) +
  scale_y_continuous(breaks = seq(0, 10, 2), lim = c(0, 10)) +
  theme(legend.position="top")

3. Astronauts Selection

An example of the astronaut selection process:

Genre

How is the gender relationship in the space program since them?
Only 14% of the astronauts selected were women so far. Unfortunately there is a “evidence that women remain underrepresented in the Stem fields of science, technology, engineering and math” (from this article).
I’m kind of bored of pie charts, so let’s see in a new way:
source

waffle_df <- nasa %>% group_by(gender) %>% summarise(astrounauts = length(astronaut))
waffle_df <- round(c(`Female (48 - 14%)`=waffle_df[[1,2]]/nrow(nasa)*100, 
                     `Male (287 - 86%)`=waffle_df[[2,2]]/nrow(nasa)*100))
# sorry, I could not do it more automatic...

waffle(waffle_df, rows = 10) + scale_fill_brewer(palette = "Set1")

Astrounats selected by year, age of selection and gender

An interesting approach is to investigate how old the astronaut was at the time of selection. One step further is to cross it with gender and civil status (civil/militar). Are these 3 variables connected?

g1 <- ggplot(nasa, aes(x=selection_year, y=age_selection, colour = gender)) + 
  geom_point(size = 2) + geom_jitter() +
  labs(title="Age of Selection by Selection Year",  subtitle = "Scatterplot",
       x = "Selection Year", y = "Age of Selection", color = "Gender") +
  theme(legend.position="top") +
  scale_colour_brewer(palette = "Set1")
g1

Same as above distinguished by civil status

g1 + facet_grid(. ~ civil_status)

Did you notice an increase in the age of selection over time? Let’s make a simple regression to try to visualize it!
It seems that for every civil status and genre there has been an increase in the age of selection of astronauts over time:

g1 + facet_grid(gender ~ civil_status) + geom_smooth(method = "lm")

Civil Status

Historically 61.8% of the selected astronauts were military, so let’s visualize these!

table_gender_status <- table(nasa$civil_status, nasa$gender)
addmargins(round(prop.table(table_gender_status)*100, 1))
##           
##            Female  Male   Sum
##   Civilian   10.4  27.8  38.2
##   Military    3.9  57.9  61.8
##   Sum        14.3  85.7 100.0

Visually there is a higher proportion of military men than among women, noting the table above. And how was this proportion among the selected groups?
On the one hand, historically the rate of military men was almost always greater than 50% (only 3 groups had a lower rate). However, among women there was a military majority only in the 1990s, while in the other periods there was not a single woman selected.

ggplot(nasa, aes(x=as.factor(selection_year), fill = civil_status)) + 
  geom_bar(aes(y=..count..), position = "fill") +
  scale_y_continuous(labels = percent_format()) +
  facet_grid(. ~ gender) +
  theme(legend.position="top") +
  labs(title="Civil Status per Year of Selection",  subtitle = "Bar Chart",
       x = "Selection Year", y = "Proportion os Astronauts", fill = "Civil Status") +
  theme(axis.text.x = element_text(angle=90)) +
  scale_fill_brewer(palette = "Set2")

By creating these charts I accidentally did the next chart.
It is interesting to note the behavior of the civil status in relation to the age of selection. It looks like a “normal” distribution for military astronauts due the age of selection, but with very strange tails!

ggplot(nasa, aes(x=age_selection, fill = civil_status)) + 
  geom_bar(aes(y=..count..), position = "fill") + 
  scale_y_continuous(labels = percent_format()) +
  theme(legend.position="top") +
    labs(title="Civil Status per Age of Selection",  subtitle = "Bar Chart",
       x = "Age of Selection", y = "Proportion os Astronauts", fill = "Civil Status") +
  scale_fill_brewer(palette = "Set2")

Military Force x Civilian

If most of the astronauts came from a military background, let’s see what this background was! What are the military forces in the astronaut team?

table_military_force <- nasa %>% group_by(military_force) %>% 
  summarise(astrounauts = length(astronaut))

table_military_force$military_force <- factor(table_military_force$military_force, levels = table_military_force$military_force[order(-table_military_force$astrounauts)])

ggplot(table_military_force, aes(x = military_force, y = astrounauts, fill = -astrounauts)) +
  geom_bar(stat="identity") + geom_text(aes(label = astrounauts, y = astrounauts + 5)) +
  labs(title = "Number of Astronauts for Military Force", subtitle = "Bar Chart",
       x = "Military Force", y = "# Astronauts Selected") +
  theme(legend.position="none") +
  theme(axis.text.x = element_text(angle=90))

Let’s visualize that (but I will not show 2 military forces with just 1 astronaut):

selection_year_age <- ggplot(nasa %>% filter(military_force != "Japan Air Self-Defense Force",
                                             military_force != "Royal Canadian Air Force",
                                             military_force != "US Coast Guard"),
                             aes(x=selection_year, y=age_selection, colour = military_force)) + 
  geom_point(size = 1) + 
  geom_jitter() +
  theme(legend.position="top") +
  labs(title="Age of Selection for Selection year",  subtitle = "Scatterplot",
       x = "Selection Year", y = "Age of Selection", fill = "Civil Status") +
  scale_colour_discrete(name  ="Military Force") + scale_colour_brewer(palette = "Dark2")
selection_year_age

Not so good, right? The same chart with facet:

selection_year_age + facet_grid(gender ~ military_force) +
  scale_x_continuous(breaks=seq(1960,2010,20),labels=seq(1960,2010,20))

Tree map of the military in NASA

source

tree_military_force <- nasa %>% filter(military_force != "civilian") %>%
  group_by(military_force, military_position) %>% summarise(astrounauts = length(astronaut))

ggplot(tree_military_force, aes(area = astrounauts, fill = military_force, label = military_position,
                subgroup = military_force)) +
  geom_treemap() +
  geom_treemap_subgroup_border() +
  geom_treemap_subgroup_text(place = "centre", grow = T, alpha = 0.5, colour =
                               "black", fontface = "italic", min.size = 0) +
  geom_treemap_text(colour = "white", place = "topleft", reflow = T) +
  scale_fill_brewer(palette = "Dark2") +
  theme(legend.position = "bottom") +
  labs(
    title = "Proportion of Astronauts of each Military Force",
    caption = "The area of each tile represents the military force as a
    proportion of all astrounauts in NASA",
    fill = "Military Force"
  )

Gender Proportion

Are there more or less women being selected?
At most there were 25% of women in a selected group. And the trend does not indicate an equal distribution…

ggplot(nasa, aes(x=as.factor(selection_year))) + 
  geom_bar(aes(fill = gender), position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(title="Gender Proportion for Selection Year",  subtitle = "Bar Chart",
       x = "Selection Year", y = "Proportion", color = "Period") +
  theme(legend.position="top") +
  scale_fill_brewer(palette = "Set1") 

How is the gender and civil status selection?
source 1 and source 2

ggplot(nasa, aes(x = age_selection, fill = gender)) + 
  geom_histogram(data=subset(nasa, gender == "Female"),aes(y=..count..*(1))) + 
  geom_histogram(data=subset(nasa, gender == "Male"),aes(y=..count..*(-1))) + 
  scale_y_continuous(breaks=seq(-35,30,5),labels=abs(seq(-35,30,5))) +
  coord_flip() +
  scale_fill_brewer(palette = "Set1") +
  facet_grid(. ~ civil_status, margins=TRUE) +
    theme(legend.position="top") +
  labs(title="Population Pyramid by Age",  subtitle = "Population Pyramid Chart",
       x = "Age of Selection", y = "# of Astronauts", fill = "Gender")

We can visualize how the age pyramid developed by the selected groups. For that, I’m going to make an animation showing each of the 20 groups, but only for the total, okay?

ggplot(nasa, aes(x = age_selection, fill = gender)) + 
  geom_histogram(data=subset(nasa, gender == "Female"),aes(y=..count..*(1))) + 
  geom_histogram(data=subset(nasa, gender == "Male"),aes(y=..count..*(-1))) + 
  scale_y_continuous(breaks=seq(-35,30,5),labels=abs(seq(-35,30,5))) +
  coord_flip() +
  scale_fill_brewer(palette = "Set1") +
  theme(legend.position="top") +
  labs(title="Population Pyramid by Age, Group: {closest_state}",  subtitle = "Population Pyramid Chart",
       x = "Age of Selection", y = "# of Astronauts", fill = "Gender") +
  transition_states(group, transition_length = 2, state_length = 3) +
  enter_fade() + 
  exit_shrink() +
  ease_aes('sine-in-out')

Are the distributions between the age of selection different between the genres and civil status? (It does not seem so differe t…)

ggplot(nasa, aes(x = age_selection, fill = gender)) + 
  geom_histogram(data=subset(nasa, gender == "Female"),aes(y=..density..*(1))) + 
  geom_histogram(data=subset(nasa, gender == "Male"),aes(y=..density..*(-1))) + 
  scale_y_continuous(breaks=seq(-0.5,0.5,0.2),labels=abs(seq(-0.5,0.5,0.2)*100)) +
  coord_flip() +
  scale_fill_brewer(palette = "Set1") +
  facet_grid(. ~ civil_status) +
    theme(legend.position="top") +
  labs(title="Population Pyramid by Age",  subtitle = "Population Pyramid Chart",
       x = "Age of Selection", y = "% of Astronauts", fill = "Gender") +
  scale_y_continuous(labels = scales::percent)

4. Space Flights

Disclaimer: Maybe this variable is not so good reported. Some astronauts have done more flights than the dataset says. Because of this, I will not pay so much attention to this variable.

Number of Flights

ggplot(nasa, aes(x=num_flights, y=..count.., fill = gender)) + 
  geom_bar(position="dodge") +
  scale_x_continuous(breaks=seq(0,7,1),labels=seq(0,7,1)) +
  labs(title="Distribution of the # of Space Flights",  subtitle = "Bar Chart",
       x = "# of Space Flights", y = "# of Astronauts", fill = "Gender") +
  theme(legend.position="top") +
  scale_fill_brewer(palette = "Set1") 

By gender and civil status

Okay, there are fewer female astronauts, so the nominal distribution does not present the best comparison. And how is this proportional distribution by gender and civil status? It seems that no matter the group, there was a same distribution of the number of space flights:
(remembering that I do not think this variable is reliable and therefore worthy of evaluation)

ggplot(nasa, aes(x = num_flights, fill=civil_status)) + geom_bar(aes(y = ..prop..)) + 
  facet_grid(civil_status ~ gender, margins=TRUE) +
  scale_y_continuous(labels = scales::percent) +
  labs(title="Proportion of the # of Space Flights by Age and Civil Status",  subtitle = "Bar Chart",
       x = "# of Space Flights", y = "% of Astronauts", fill = "Gender") +
  theme(legend.position="none") +
  scale_fill_brewer(palette = "Set2") 

Space Flight Time

The database presents spatial flight time in hours, but this is difficult to visualize. Let’s turn the unit to days!

nasa$space_fligh_time <- nasa$space_fligh_time / 24

Histogram by gender

Clearly there are two groups of astronauts that stay in space: less and more than 3 months. And 3 months is the average time for fixed ISS residents.

ggplot(nasa, aes(x=space_fligh_time, y=..count.., fill = gender)) + 
  geom_histogram(position="dodge") +
     labs(title="Distribution of Space Flight Time",  subtitle = "Histogram Chart",
       x = "Space Flight Time [days]", y = "# of Astronauts", fill = "Gender") +
  scale_fill_brewer(palette = "Set1") 

Frequency by gender and civil status

The story for the space flight time is the same for the number of flights. There does not appear to be any discrimination between gender and civil status:

ggplot(nasa, aes(x = space_fligh_time, fill=civil_status)) + geom_histogram(aes(y = ..density..)) + 
  facet_grid(civil_status ~ gender, margins=TRUE) +
  scale_y_continuous(labels = scales::percent) + geom_rug() +
  labs(title="Proportion of  Space Flight Time by Age and Civil Status",  subtitle = "Histogram Chart",
       x = "Space Flight Time [days]", y = "% of Astronauts", fill = "Gender") +
  theme(legend.position="none") +
  scale_fill_brewer(palette = "Set2")

Number of flights X space time

ggplot(nasa, aes(x=jitter(num_flights, 1), y=space_fligh_time, colour = gender)) + 
  geom_point(size = 2) + geom_jitter() +
  scale_x_continuous(breaks=seq(0,7,1),labels=seq(0,7,1)) +
  labs(title="Space Flight Time by # of Space Flights",  subtitle = "Scatterplot",
       x = "# of Space Flights", y = "Space Flight Time [days]", colour = "Gender") +
  theme(legend.position="top") +
  scale_colour_brewer(palette = "Set1") +
  geom_rug(sides="l")

Average Space Time

This is one of the major findings of this exploratory data analysis: historically few astronauts have spent more than 2 ISS expeditions (3 months). Since those who went through double sessions (2 trips to return to Earth) did not make many more space travels.

nasa$average_space_time <- nasa$space_fligh_time / nasa$num_flights
ggplot(nasa %>% filter(num_flights > 0), aes(x=jitter(num_flights, 1), 
                                             y=average_space_time, colour = gender)) + 
  geom_point(size = 2) + geom_jitter() + 
    scale_x_continuous(breaks=seq(0,7,1),labels=seq(0,7,1)) +
  labs(title="Average Space Flight Time by # of Space Flights",  subtitle = "Scatterplot",
       x = "# of Space Flights", y = "Average Space Flight Time [days]", colour = "Gender") +
  theme(legend.position="top") +
  facet_grid(. ~ civil_status) +
  scale_colour_brewer(palette = "Set1") +
  geom_rug(sides="l")

Selection Year X Space Time

It is observable that after 1969 there was a technology leap that allowed astronauts to spend more than 3 months in space!

ggplot(nasa, aes(x=selection_year, y=space_fligh_time, colour = gender)) + 
  geom_point(size = 2) + geom_jitter() +
  labs(title="Space Flight Time by Selection Year",  subtitle = "Scatterplot",
       x = "Selection Year", y = "Space Flight Time [days]", colour = "Gender") +
  theme(legend.position="top") +
  scale_colour_brewer(palette = "Set1") +
  geom_rug(sides="l")

Age of Selection x Space Flight Time

if you are selected younger will there be more chances of having more space flight time? The answer seems to be a resounding “no”!

ggplot(nasa, aes(x=age_selection, y=space_fligh_time, colour = gender)) + 
  geom_point(size = 2) + geom_jitter() +
  labs(title="Space Flight Time by Age of Selection",  subtitle = "Scatterplot",
       x = "Age of Selection", y = "Space Flight Time [days]", colour = "Gender") +
  theme(legend.position="top") +
  scale_colour_brewer(palette = "Set1") +
  geom_rug(sides="l")

Space Flight Time x Selection Year X Age of Selection

(Rub your hands for the next graphs!)

g2 <- ggplot(nasa, aes(x=selection_year, y=age_selection, colour = space_fligh_time)) + 
  geom_point(aes(size = space_fligh_time), alpha=0.7) + geom_jitter() +
  scale_size_continuous(range = c(0, 10)) +
  scale_colour_gradientn(colours = terrain.colors(10)) +
  labs(title="Space Flight Time by Selection Year and Age of Selection",  subtitle = "Bubble Chart",
       x = "Selection Year", y = "Age of Selection", colour = "Gender") +
  labs(size = "Space Flight Time [days]", colour="Space Flight Time [days]")
g2

g2 + facet_grid(civil_status ~ gender)

5. Current Astronauts

perhaps there is a difficulty in replacing the old astronauts:

nasa$age_now <- 2018 - nasa$year_birth

ggplot(nasa %>% filter(status == "Current "), aes(x=selection_year, y=age_now, colour = gender)) + 
  geom_point(size = 2) + geom_jitter() +
  facet_grid(. ~ civil_status, margins=TRUE) +
      labs(title="Age of Current Astronauts for Selection year",  subtitle = "Scatterplot",
       x = "Selection Year", y = "Age in 2018", colour = "Gender") +
  theme(legend.position="top") +
  scale_colour_brewer(palette = "Set1") 

Do the youngest have experience already?

g3 <- ggplot(nasa %>% filter(status == "Current "), aes(x=age_now, y=space_fligh_time, colour = gender)) +
  geom_point(size = 2) + geom_jitter() +
  theme(legend.position="top") +
  labs(title="Space Flight Time by Age of Current Astronauts",  subtitle = "Scatterplot",
       x = "Age in 2018", y = "Space Flight Time [days]", colour = "Gender") +
  scale_colour_brewer(palette = "Set1") 

ggMarginal(g3, type = "histogram", fill="transparent")

6. A Picture of Me

One of the current astronauts is German and took this great photo of Cologne. If you look closely you can find me somewhere there! Probably I was drinking a glass of Kölsch! thanks