Preliminary Discussion

Like many others, I have always had a heightened interest in space and the universe in general. Learning about and observing space provides scope for more than scientific knowledge, but places our own humanity into context. Instinctually, humans are curious and space, with all its mysteries and fantastical aura, is a curiosity playground.

Space missions are defined as (you would probably expect); a journey, by a manned or unmanned vehicle, into space for a specific reason (normally to gather scientific data). Some other key results of space missions is to implement studies about earth or other bodies in space. Another possible aim for a mission is to perform more research on earth and other solar system planets.

When thinking about Space Missions, often people resort to ideas around NASA or maybe USA and Russia and principal players in the aerospace field; whilst this does hold true in certain aspects, there are in fact many other players in the industry. Moreover there are many fast growing companies involved in the private space sector. In our analysis we shall look at both countries and companies, evaluating their successes, missions and cost efficiency.


Aim and Goal of Project

In this piece we shall analyse space missions dating back from 1957 all the to 2020. We approach the project by conducting data cleaning and manipulation - identify data errors and, if possible, correct them, as well as sort and arranger data to be usable. We then begin to explore the data set and (hopefully) pick up on interesting insights within the field of space travel. Specifically, we look at the countries and companies where space missions occur, missions, rockets and financial aspects of launches.


Data Description

This the data set was scraped from Next Space Flight and, as mentioned, includes all the space missions since the beginning of Space Race (1957). Some key events in the context of space missions related to the data set at hand are outlined below:

Date

04-10-1957

Event

first artificial Earth satellite

03-11-1957 first animal launched into space
12-4-1961 first human to orbit Earth
16-6-1963 first woman in space
20-7-1969 first human to walk on the Moon
02-11-2000 first resident crew to occupy the International Space Station
13-6-2010 first spacecraft to return to Earth with samples from an asteroid
12-11-2014 first spacecraft to land on a comet
01-1-2019 farthest object (2014 MU69) explored by a spacecraft
03-1-2019 first landing on the Moon’s far side

The data set is not complex and contains simple measures regarding the space missions and rockets launched by countries and companies. The variables in the data set:


Setup

The libraries used in this analysis and project is given below.

library(dplyr)
library(tidyverse)
library(tidytext)
library(patchwork)
library(lubridate)
library(plotly)
library(RColorBrewer)
library(viridis)
library(ggrepel)
library(naniar)
library(scales)
library(janitor)
library(DT)
library(ggpubr)

We read in the data below.

#Read in 
space_miss <- read_csv("~/pavansingho23@gmail.com - Google Drive/My Drive/Portfolio/Projects/R/Space Missions/Space_Corrected.csv") %>% 
  clean_names()

Data Cleaning

An extract of the data set is shown below, after changing the column names of a few variables.

#Clean
space_miss <- space_miss[,-1]
names(space_miss)[1] <- "index"
names(space_miss)[7] <- "cost"
DT::datatable(head(space_miss,7))

Having read in the data and changed a few column headings, we shall examine the structure and conduct appropriate cleaning of the variables.

#Structure
str(space_miss)
#Change Structures
space_miss$status_mission <- as.factor(space_miss$status_mission)
space_miss$status_rocket <- as.factor(space_miss$status_rocket)

#Change data Variable
space_miss <- space_miss %>% 
  mutate(launch_date = as_date(parse_date_time(datum, c("mdy HM", "mdy"), tz = "UTC")))

We see the encoding for the categorical variables status_mission and status_mission below. This would be of increased importance if we were conducting a regression.

contrasts(space_miss$status_rocket)
##               StatusRetired
## StatusActive              0
## StatusRetired             1
contrasts(space_miss$status_mission)
##                   Partial Failure Prelaunch Failure Success
## Failure                         0                 0       0
## Partial Failure                 1                 0       0
## Prelaunch Failure               0                 1       0
## Success                         0                 0       1

Lets look at a summary of all our variables. We also want to see if we have any missing observations for some variables for this data.

summary(space_miss) #summary of variables
##      index      company_name         location            datum          
##  Min.   :   0   Length:4324        Length:4324        Length:4324       
##  1st Qu.:1081   Class :character   Class :character   Class :character  
##  Median :2162   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2162                                                           
##  3rd Qu.:3242                                                           
##  Max.   :4323                                                           
##                                                                         
##     detail                status_rocket       cost       
##  Length:4324        StatusActive : 790   Min.   :   5.3  
##  Class :character   StatusRetired:3534   1st Qu.:  40.0  
##  Mode  :character                        Median :  62.0  
##                                          Mean   : 153.8  
##                                          3rd Qu.: 164.0  
##                                          Max.   :5000.0  
##                                          NA's   :3360    
##            status_mission  launch_date        
##  Failure          : 339   Min.   :1957-10-04  
##  Partial Failure  : 102   1st Qu.:1972-04-19  
##  Prelaunch Failure:   4   Median :1984-12-16  
##  Success          :3879   Mean   :1987-11-28  
##                           3rd Qu.:2002-09-10  
##                           Max.   :2020-08-07  
## 
sapply(space_miss, function(x) sum(is.na(x))) #number of missing values
##          index   company_name       location          datum         detail 
##              0              0              0              0              0 
##  status_rocket           cost status_mission    launch_date 
##              0           3360              0              0

Around \(77.8\%\) of our data points (missions) have missing cost values. For now we proceed without imputing or removing the data points with missing cost of mission values.

Next, we don’t explicitly have a country location variable in our data set - however, we can create one. The location variable contains the launch location and has details about the site and mentions the country. We can manipulate this variable to extract that word (country) where the launch took place, and create a variable with that.

#Create country variable
space_miss <- space_miss %>% 
  mutate(country =  word(location,-1)) #word function here with -1 will return last word in sentence (country)
length(unique(space_miss$country)) #20 different countries, but some strange names
## [1] 20

First we see that we have 20 different countries, but some of them have strange names (sites and not countries).

#Changing these weird locations - we identify these locations and change them to new
space_miss %>% dplyr::select(country, location) %>%
    filter(country %in% c("Ocean", "Sea", "Facility", "Site")) %>%
  nrow() #42 weirdly classified locations
## [1] 42

We see we ultimately have 42 weirdly classified locations. These locations have country names Ocean, Sea, Facility or Site. This is clearly a mistake. Below we see these countries in an interactive table.

datatable(
  space_miss %>% dplyr::select(country, location) %>%
    filter(country %in% c("Ocean", "Sea", "Facility", "Site")))

Above table shows the weird country names that we got from manipulating the data set to retrieve country. We see Site, Ocean as countries for example. We will manually adjust these, to specify the actual country.

space_miss <-
  space_miss %>% mutate(
    country = case_when(
      location == "LP Odyssey, Kiritimati Launch Area, Pacific Ocean" ~ "Pacific Ocean",
      location == "LP-41, Kauai, Pacific Missile Range Facility" ~ "Range Facility",
      location == "K-84 Submarine, Barents Sea Launch Area, Barents Sea" |
        # OR
        location == "K-496 Submarine, Barents Sea Launch Area, Barents Sea" |
        # OR
        location == "K-407 Submarine, Barents Sea Launch Area, Barents Sea" ~ "Barents Sea",
      location == "Tai Rui Barge, Yellow Sea" ~ "Yellow Sea",
      location == "Launch Plateform, Shahrud Missile Test Site" ~ "Shahrud Missile Test Site",
      location == "Rocket Lab LC-1A, M?hia Peninsula, New Zealand" ~ "New Zealand",
      TRUE ~  word(location, -1)
    )
  )

space_miss <- space_miss %>% 
 mutate(
    country = str_replace(country, "StatusRetired", replacement = "USA"),
    country = str_replace(country, "Yellow Sea", replacement = "China"),
    country = str_replace(country, "Russia", replacement = "Russian Federation"),
    country = str_replace(country, "Shahrud Missile Test Site", replacement = "Iran"),
    country = str_replace(country, "Range Facility", replacement = "USA"),
    country = str_replace(country, "Barents Sea", replacement = "Russian Federation"),
    country = str_replace(country, "Canaria", replacement = "USA")
  )

After conducting these changes, lets have a look at some initial results - sorting countries with most to least space missions.

#Some final summary of results
space_miss %>% count(country, sort =T) %>% head(5) #Russia most missions
## # A tibble: 5 × 2
##   country                n
##   <chr>              <int>
## 1 Russian Federation  1398
## 2 USA                 1347
## 3 Kazakhstan           701
## 4 France               303
## 5 China                269
space_miss %>% count(country, sort =T) %>% tail(5) #Brazil with Least most missions
## # A tibble: 5 × 2
##   country       n
##   <chr>     <int>
## 1 Kenya         9
## 2 Korea         8
## 3 Australia     6
## 4 Mexico        4
## 5 Brazil        3

The resulting output is shown above, where we see the top 5 and bottom 5 countries for the number of launches conducted.

In summary, what we did was create country variable for where launch happened. There are some locations names such as (“Ocean”, “Sea”, “Facility”, “Site”) need some clarifications and to be fixed. So what we do is we first identify the classified places. We see what the location names were. We then convert these location names to something else using case when. Then we replace these strings in our country with the actual country where these locations are found to be.

Before we end this section, we show an interactive table with the cleaned data and new variable country, below.

datatable(
  space_miss %>% dplyr::select(-index, -location, -datum))

Data Exploration

We have split this analysis into two different sections. The first section concerns the countries and companies within the context of launches and missions statuses, as well as their launch history. The latter (second) section deals with the financial aspect of the space missions with respect to the companies and countries in the data-set.

Part 1 - Country Analysis

We begin by looking at the number of active and retired rockets in circulation for each country in our data set.

#Active and Nonactive Rockets for Each Country 
ggplot(space_miss) +
 aes(x = country, fill = status_rocket) +
 geom_bar() +
 scale_fill_hue(direction = 1) +
 theme_minimal()

The USA, Russia and Kazakhstan have the top 3 most launched rockets. The majority of which are retired. China is one of the few countries which have more active rockets than retired in circulation - New Zealand and Mexico also fall under that band. Russia have produced the most rockets, but only a few of them are still active.

We removed RVSN USSR and Kazakhstan to further analyse the companies and the number of launches in each country.

#Without USSR
space_miss %>%
 filter(!(country %in% c("Kazakhstan", "Russian Federation"))) %>%
 ggplot() +
 aes(x = company_name, fill = country) +
 geom_bar() +
 scale_fill_manual(values = c(Australia = "#F8766D", 
Brazil = "#E48432", China = "#FFC83D", France = "#ACA000", India = "#7FAC07", Iran = "#000000", Israel = "#00BB4C", 
Japan = "#51FFC8", Kazakhstan = "#00BEB1", Kenya = "#3B5F32", Korea = "#20AFEC", Mexico = "#549FFB", 
`New Zealand` = "#C9C6FF", `Pacific Ocean` = "#46135C", `Russian Federation` = "#FF06EA", USA = "#9A738B"
)) +
 coord_flip() +
 theme_minimal()

This provides a clear breakdown of who might be the biggest players in terms of rocket launches in the aerospace field. We see Arianespace from France has the most launches with well over 250. Interestingly, NASA does not have the most launches in USA. That title belongs to General Dynamics.

Let’s compute the top 10’s for country first, and then companies. That is, the top 10 countries/companies by number of launches.

# Function for Top Tens
mycolplot <- function(data, target, colour, num) {
  
  data %>% 
  mutate(target2 = fct_lump({{target}}, num)) %>%
  group_by(target2) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count)) %>%
  ungroup() %>%
  ggplot(aes(
    x = fct_reorder(target2, Count),
    y = Count,
    label = Count
  )) +
  geom_col(fill =  colour) +
  geom_text(aes(
    hjust = case_when(Count >= median(Count) ~ 1.2,      #text inside (bar) if number launches above median
                      Count < median(Count) ~ -0.1),     #text outside (bar) if number launches below median
    col = case_when(Count >= median(Count) ~ "white",    #text white if greater than median launches
                    Count < median(Count) ~ "black")     #text black if lesser than median launches
  ), size = 3) +
  coord_flip() +
  scale_colour_identity() +
  theme(legend.position = "none")
}
#Top 10 Countries
mycolplot(space_miss, country, "#91003f", 10) +
  labs(
    title = "Top 20 Countries by Launches",
    caption = "A descending list of the top 10 countries by number of launches",
    subtitle = "Space Missions from 1957",
    x = "Country name",
    y = "Number of launches"
  )

The resulting figure shown above, shows that Russia has the most launches, followed by USA (closely). The other 18 countries making the list are shown above as well. Let’s look at the top 15 companies with respect to their launch count, now.

#Top 10 Companies
mycolplot(space_miss, company_name, "#00aedb", 10) +
  labs(
    title = "Top 10 Companies",
    caption = "A descending list of the top 10 companies by number of launches",
    subtitle = "Space Missions from 1957",
    x = "Company name",
    y = "Number of launches"
  ) 

We can clearly see that RVSN USSR greatly lead the pack. Having conducted more than double the number of launches than the closest competitor (which includes several companies - “Other”), and more than \(6\times\) the number of launches to the third company (Arianespace). We can remove them from the data sample and see what the distribution looks like without USSR.

space_miss %>% filter(!company_name %in% c("RVSN USSR", "Other")) %>% 
mycolplot(.,company_name, "#f37735", 15)+
  labs(
    title = "Top 15 Companies without RVSN USSR!",
    caption = "A descending list of the top 15 companies wihtout USSR",
    subtitle = "Space Missions from 1957",
    x = "Company name",
    y = "Number of launches"
  )

We can get further clear look at the other companies operating in this field. NASA, as we have already established, surprisingly doesn’t conduct more launches than all the other companies. CASC, General Dynamics and Arianespace all conducted more launches.

General Dynamics Corporation is an American publicly traded, aerospace and defense corporation headquartered in Reston, Virginia. Within the aerospace field, they design, build and manage ground-based systems that enable communications and control of satellite networks and spacecraft exploring the unknown. However they are also a household name in defense systems. From Gulfstream business jets and combat vehicles to nuclear-powered submarines and communications systems, people around the world depend on purchase products and services for their safety and security from General Dynamics.

The China Aerospace Science and Technology Corporation, or CASC, is the main contractor for the Chinese space program. It is state-owned and has subsidiaries which design, develop and manufacture a range of spacecraft, launch vehicles, strategic and tactical missile systems, and ground equipment. It was officially established in July 1999 as part of a Chinese government reform drive, having previously been one part of the former China Aerospace Corporation.

Arianespace SA is a French company founded in 1980 as the world’s first commercial launch service provider. It undertakes the operation and marketing of the Ariane program. Established on March 26, 1980, it was the world’s first private company dedicated to space transportation.

#Time series analysis - number of launches from X to Y years for several companies


space_miss %>% dplyr::select(company_name, launch_date, country) %>% 
  mutate(company_name = fct_lump(company_name, 11)) %>%
  filter(company_name != "Other") %>%                                          # Remove "other" company from results
  mutate(launch_year = year(launch_date)) %>%
  mutate(comp_country_loc = paste(company_name, country, sep = " - ")) %>%     # add variable company - country for plotting
  group_by(comp_country_loc) %>%
  summarise(
    Count = n(),
    launch_year = launch_year,
    start_year = min(launch_year),
    end_year = max(launch_year),
    age = end_year - start_year
  ) %>%
  ungroup() %>%
  group_by(comp_country_loc, launch_year, start_year, end_year) %>%
  summarise(Count = n()) %>%                                                   # data frame with company, count of launches in each year
  filter(!comp_country_loc %in% c("CASC - Yellow Sea",
                                  "Arianespace - Kazakhstan")) %>% 
arrange(desc(Count)) %>%
  ggplot(aes(x = launch_year, y = Count)) +
  geom_point(aes(col = comp_country_loc), size = 1) +
  geom_line(aes(col = comp_country_loc), size = 1.3) +
  geom_text(aes(
    x = start_year+2,
    y = max(Count-5),
    label = paste("From", start_year)
  ), size = 4) +
  geom_text(aes(
    x = end_year-2,
    y = max(Count-5),
    label = paste("To", end_year)
  ), size = 4) +
  facet_wrap(vars(comp_country_loc),
             scales = "free_x",
             ncol = 3) +
  theme(axis.text.x = element_text(
    angle = 90,
    vjust = 0.5,
    hjust = 1
  )) +
  theme(legend.position = "none") +
  scale_x_continuous(expand = expansion(add = c(8, 8))) +
  scale_color_viridis(discrete = TRUE) +
  labs(
    title = "Country, Company, # of launchs time series",
    subtitle = "Line plot, Country, Company, # of launchs time series",
    caption = "Kaggle: All Space Missions from 1957",
    x = "Launch year",
    y = "Number of launches"
  )

The above plot shows the number of launches per year for several companies in our data set. We also see the country for which the company operates (launches) in. Essentially we have a time series profile for each company with respect to number of launches. Looking at RVSN USSR we see that during 1970 to 1990 it conducted a lot of launches compared to other organisations. In fact no other organisations shows any trend and magnitude comparable to that of RVSN USSR. Notably, CASC from China, appears to have gained considerable traction as of late (in recent times) given the recent rapid rise in number of launches over the recent years leading up to 2020.

We shall now make use of plotly package to generate some interactive plots. This can help create a clearer picture and allow for more engagement from user to highlight certain information by simply hovering over data points to reveal more information. Lets begin by looking at the launches per year and highlighting the accompanying mission statuses.

ggplotly(
  space_miss %>% dplyr::select(launch_date, status_mission) %>%
    mutate(launch_year = year(launch_date)) %>%
    group_by(launch_year, status_mission) %>%
    summarise(Count = n()) %>%
    arrange(desc(Count)) %>%
    ggplot(aes(
      x = launch_year, y = Count, fill = status_mission
    )) +
    geom_col() +
    scale_x_continuous(breaks = seq(
      from = 1956, to = 2020, by = 4
    )) +
    scale_fill_brewer(palette = "Set1") + 
    theme(axis.text.x = element_text(
      angle = 90,
      vjust = 0.5,
      hjust = 1
    )) +
    theme(axis.line = element_line(colour = "darkblue",
                                   size = 2)) +
    labs(
      title = "Launches by year",
      subtitle = "Launches by year by Mission Staatus",
      caption = "All Space Missions from 1957",
      x = "",
      y = "",
      fill = ""
    )
)

Most failures occurred earlier on - between 1956 and 1980. In more recent times failures still occur, but appear to happen not as a regularly, than in the past, and understandably so - technology and effectiveness increases with time.

ggplotly(
  space_miss %>% dplyr::select(launch_date, country) %>%
    mutate(country = fct_lump(country,5)) %>% 
    mutate(launch_year = year(launch_date)) %>%
    group_by(country, launch_year) %>%
    summarise(Count = n()) %>%
    arrange(desc(Count)) %>%
    ggplot(aes(
      x = launch_year, y = Count, fill = country
    )) +
    geom_col(position = "fill") +
    scale_x_continuous(breaks = seq(
      from = 1956, to = 2020, by = 4
    )) +
    scale_fill_brewer(palette = "Set1") + 
    theme(axis.text.x = element_text(
      angle = 90,
      vjust = 0.5,
      hjust = 1
    )) +
    theme(axis.line = element_line(colour = "darkblue",
                                   size = 2)) +
    labs(
      title = "launches by year by country (leading)",
      subtitle = "Column plot, launches by year by country (leading)",
      caption = "Kaggle: All Space Missions from 1957",
      x = "",
      y = "",
      fill = ""
    )
)

There are 3 clear comments that can be made.

  • In the early years, from 1957 to about 1964, USA conducted the most number of launches by some margin. It is worth noting that in 1957 Kazakhstan did actually conduct more launches than USA. Russia at that time had not conducted any launches. Their first launch occurred in 1963.

  • From 1964 to around 1980, Kazakhstan increased their launches per year and actually out performed USA in terms of this metric. At the same time Russia rapidly conducted a lot more space missions. They had a considerably higher launches per year between 1968 and 1990 than USDA and Kazakhstan.

  • From 1988 USA had began to launch a lot more rockets, and this has led to them retaining the title of most launches from 1988 to around 2016.

  • China slowly started conducting launches from 1969. They steadily increase their number of launches annually. In more recent times they have grown significantly and have conducted more launches annual than France, Kazakhstan and Russia for the past couple years now.

space_miss %>% dplyr::select(launch_date, country) %>% 
  mutate(country = fct_lump(country, 5)) %>%
  mutate(launch_month = month(launch_date, label = TRUE),
         launch_year = year(launch_date)) %>% 
  filter(launch_year == 2020) %>% 
  group_by(country, launch_month) %>% 
  summarise(Count = n()) %>%
  arrange(desc(Count)) %>% 
  ggplot(aes(x = launch_month, y = Count, label = Count)) +
  geom_col(aes(fill = launch_month)) + 
  geom_text(vjust =1.2 , col = "white", size = 6) +
  scale_fill_viridis(discrete = TRUE) +
  facet_grid(vars(country)) +
  theme(legend.position = "none") +
  labs(
    title = "Th Year 2020 and Countries Launches per Month",
    subtitle = "Countries and the number of launches per month last year",
    caption = "All Space Missions from 1957",
    x = "Launch Month",
    y = "Number of launches"
  )

Trying to identify a trend in the data, we see that for USA, launches happen regulatory, and appear to have distinct patter. Although April appears to have been a low month, in terms of launches for 2020. For China it appears most launches occur between May and July, particularly for that year, 2020. France only conducted two launches in 2020, and they occurred early on in the year, January and February respectively.

g2 <- space_miss %>% dplyr::select(status_mission) %>% 
  count(status_mission, sort = T) %>%
  mutate(prop = paste0(round(n / sum(n) * 100, 2), "%")) %>%
  ggplot(aes(x = status_mission, y = prop)) +
  geom_bar(aes(fill = status_mission),
           stat = "identity",
           position = "stack"
           ) +
  coord_flip() +
  scale_fill_manual(values = c('#e41a1c', '#377eb8', "#fdae61", "#1a9641")) +
  geom_text(
    aes(y = prop, label = prop),
    hjust = 1.1,
    size = 4,
    col = "white",
    fontface = "bold"
  ) +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  ) +
  theme(legend.position = "none") +
  labs(
    title = "Missions Status",
    subtitle = "Pie Plot, Missions Status",
    caption = "Kaggle: All Space Missions from 1957",
    fill = "",
    x = "Missions Status",
    y = "Proportions"
  )

g1 <- space_miss %>% dplyr::select(status_rocket) %>% 
  mutate(
    status_rocket = str_replace(status_rocket, "StatusRetired", replacement = "Retired"),
    status_rocket = str_replace(status_rocket, "StatusActive", replacement = "Active")
  ) %>% 
  count(status_rocket, sort = T) %>%
  mutate(prop = paste0(round(n / sum(n) * 100, 2), "%")) %>%
  ggplot(aes(x = status_rocket, y = prop)) +
  geom_bar(aes(fill = status_rocket),
           stat = "identity",
           position = "stack",
           width = 1) +
  coord_flip() +
  scale_fill_manual(values = c("#1a9641", "#d7191c")) +
  geom_text(
    aes(y = prop, label = prop),
    hjust = 3, 
    size = 4,
    position = position_dodge(width = 1),
    col = "white",
    fontface = "bold"
  ) +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  ) +
  theme(legend.position = "none") +
  labs(
    title = "Rocket Status",
    caption = "Rocket statuses from all space missions from 1957",
    fill = "",
    x = "Rocket Status",
    y = "Proportions"
  )

ggarrange(g1, g2, nrow = 2)

Of all the rockets produced, around \(81\%\) have been retired. Most missions appear to be successful. With around \(7.8\%\) classified as failure. Only \(0.09\%\) of missions in the data set from 1957 have recorded a pre-launch failure - New Zealand.

Part 2 - Financial Aspect

We begin by analyzing the cost aspect for the each country.

#Cost Distribution
ggplot(space_miss) +
 aes(x = cost, fill = country) +
 geom_histogram(bins = 30L) +
 scale_fill_hue(direction = 1) +
 theme_minimal()

USA has by far the majority of the most expensive space missions. New Zealand space mission is the most expensive out of all the countries - and by a very significant margin.

We shall remove some space missions which cost over 250 million, to obtain a clearer view of the other result.

#Cost Distribution with filters (closer look)
space_miss %>%
 filter(cost >= 0L & cost <= 250L | is.na(cost)) %>%
 ggplot() +
 aes(x = cost, fill = country) +
 geom_histogram(bins = 30L) +
 scale_fill_hue(direction = 1) +
 theme_minimal()

#distribution of Launches per Company

We filtered the results to achieve a more detailed view of the costs per country. We see that France has some pretty expensive missions, as well as USA - however there distribution is a lot more spread out. What is important to note when analyzing the cost of the missions, is that we do not know the objective/aims of the missions, how far they traveled or what they did. These are all pertinent questions and have significant ramifications on consumption/requirement of resources - cost.

Let’s have a look at the average cost of mission per country.

#Average Cost of Mission per Country
space_miss %>% dplyr::select(cost, country) %>%
  filter(!is.na(cost)) %>%
  group_by(country) %>%
  summarise(
    rocket = cost,
    cost_mean = mean(cost),
    sdiv = sd(rocket)
  ) %>%
  ungroup() %>%
  group_by(country, cost_mean, sdiv) %>%
  summarise(Count = n()) %>%
  filter(Count > 13) %>%
  arrange(desc(Count)) %>%
  ggplot(aes(x = fct_reorder(country, - cost_mean), y = cost_mean, label = round(cost_mean,2))) +
  geom_col(aes(fill = country)) +
  geom_text(
    vjust = 1.5, 
    size = 4,
    col = "white",
    fontface = "bold"
  ) +
  scale_fill_manual(values = c(
    '#e41a1c',
    '#377eb8',
    '#4daf4a',
    '#984ea3',
    '#ff7f00',
    '#e6ab02',
    '#a65628'
  )) +
  theme(legend.position = "none") +
   theme(axis.text.x = element_text(
    angle = 90,
    vjust = 0.5,
    hjust = 1
  )) +
  labs(
    title = "Avarege Cost of the mission by Country",
    subtitle = "Column Plot, ",
    caption = "All Space Missions from 1957",
    fill = "",
    x = "Country",
    y = "mln USD"
  )

Interestingly, although US and Russia had the highest number of launches, Kazakhstan have the highest average cost of missions. This was followed by USA. Russia, has a relatively low cost of mission - around 40 million USD. The results and countries shown in the visualization include those who had over 13 space missions conducted. Of all these, India had the lowest average cost of space missions.

Although having the second highest average cost of missions, USA has conducted a lot more missions than Kazakhstan and many other countries. As such it is no surprise to see below, that the USA has over 100 Billion total cost on space missions. This is almost 17 times more than the next country (France at 16.3 billion USD). Russia, having conducted the most space missions has in fact a very low (relative) total cost at only 2.2 billion.

space_miss %>% dplyr::select(country, cost) %>%
  mutate(country = fct_lump(country, 25)) %>%
  filter(!is.na(cost)) %>%
  group_by(country) %>%
  summarise(Total_spent = sum(cost)/1000) %>%
  ungroup() %>%
  arrange(desc(Total_spent)) %>%
  ggplot(aes(x = fct_reorder(country, Total_spent), y = Total_spent, label = paste0(round(Total_spent,1),"B"))) +
  geom_col(fill = "Steelblue") +
  geom_text(hjust = -0.2) +
  coord_flip() +
  scale_y_continuous(labels = unit_format(unit = "B"), expand = expansion(add = c(0, 13))) +
  labs(
    title = "Total Cost by Country",
    subtitle = "Pie Plot, Total Cost by Country",
    caption = "Kaggle: All Space Missions from 1957",
    fill = "",
    x = "Country Name",
    y = "Total"
  )

This is a good indication and example of cost effectiveness by Russia (USSR). Lets look at the violin plot for the cost of missions in millions of dollar for each country.

space_miss %>% dplyr::select(cost, country) %>% 
  filter(!is.na(cost)) %>% 
    ggplot(aes(x = country, y = cost)) +
    geom_violin(trim = T, fill = "steelblue") +
  geom_jitter()+
  theme(axis.text.x = element_text(
    angle = 90,
    vjust = 0.5,
    hjust = 1
  ))+
  scale_y_log10(labels = dollar)+
  labs(
    title = "Cost of the mission: in $ million by Countries",
    subtitle = "Violin and boxplot Plot",
    caption = "All Space Missions from 1957",
    fill = "",
    x = "Country",
    y = "Cost by mln USD"
  )

If we examine the plot above, we can inspect the violin plot for each country cost of space missions. A violin plot is a hybrid of a box plot and a kernel density plot, which shows peaks in the data. Essentially we can examine the density of the cost distribution for each country. USA have a very large range. With most of missions being around 50, 100 million USD. Russian Federation could have a very stringent budget or very cost effective, as most missions appear to be in a much smaller range. An interesting side thought, is that as time progresses from 1957, it could be reasonable to assume that the cost of missions should decrease, as countries become more effective at spending and productive. However there is little evidence seen so far to suggest this.

space_miss %>% dplyr::select(cost, company_name) %>% 
  filter(!is.na(cost)) %>% 
    ggplot(aes(x = company_name, y = cost)) +
    geom_violin(trim = T, fill = "steelblue") +
  geom_boxplot(width = 0.5, fill = '#e41a1c') +
  geom_jitter()+
  theme(axis.text.x = element_text(
    angle = 90,
    vjust = 0.5,
    hjust = 1
  ))+
  scale_y_log10(labels = dollar) +
  labs(
    title = "Cost of the mission: in $ million by Company",
    subtitle = "Violin and boxplot Plot, ",
    caption = "Kaggle: All Space Missions from 1957",
    fill = "",
    x = "Country",
    y = "Cost by mln USD"
  )

If we look at the individual companies, we see that the majority of companies have a somewhat similar box-plot - with a median cost of some where between 50 and 60 million USD. NASA has a significantly higher cost than most other companies, this could be a consequence of them having a bugger budget and being able to conduct more advanced or more resource demanding space missions.

We can examine the total cost for each company as well, shown below.

space_miss %>% dplyr::select(company_name, cost) %>%
  mutate(company_name = fct_lump(company_name, 25)) %>%
  filter(!is.na(cost)) %>%
  group_by(company_name) %>%
  summarise(Total_spent = sum(cost)/1000) %>%
  ungroup() %>%
  arrange(desc(Total_spent)) %>%
  ggplot(aes(x = fct_reorder(company_name, Total_spent), y = Total_spent, label = paste0(round(Total_spent,2),"B"))) +
  geom_col(fill = "Steelblue") +
  geom_text(hjust = -0.2) +
  coord_flip() +
  scale_y_continuous(labels = unit_format(unit = "B"), expand = expansion(add = c(0, 10))) +
  labs(
    title = "Total Cost by company",
    subtitle = "Pie Plot, Total Cost by company",
    caption = "Kaggle: All Space Missions from 1957",
    fill = "",
    x = "Company Name",
    y = "Total"
  )

NASA has a significantly higher cost than other companies at over 70 Billion USD. Second, is Arianespace with “only” 16.34 Billion.

ULA is third, with 14.8 billion. United Launch Alliance is an American spacecraft launch service provider that manufactures and operates a number of rocket vehicles that are capable of launching spacecraft into orbits around Earth and to other bodies in the Solar System. It is a joint venture between Lockheed Martin Space and Boeing Defense, Space & Security, was formed in December 2006. Headquartered in Denver, Colorado, ULA’s rockets are among the largest and most powerful in the industry.

g2 <- space_miss %>% dplyr::select(country, status_mission, cost) %>% 
  drop_na() %>% 
   filter(!is.na(cost),
          status_mission != "Success") %>% 
  group_by(country) %>% 
  summarise(Total_loss = sum(cost)/1000) %>% 
  ungroup() %>% 
  ggplot(aes(x = fct_reorder(country, Total_loss), y = Total_loss, label = paste0(round(Total_loss,2),"B")))+
  geom_col(fill = "Steelblue") +
  geom_text(hjust = -0.2) +
  coord_flip() +
  scale_y_continuous(labels = unit_format(unit = "B"), expand = expansion(add = c(0, 0.3))) +
  labs(
    title = "Cost of mission failure",
    subtitle = "Pie Plot",
    caption = "All Space Missions from 1957",
    fill = "",
    x = "Company Name",
    y = "Total losses"
  )

g1 <- space_miss %>% dplyr::select(country, status_mission, cost) %>% 
  drop_na() %>% 
   filter(!is.na(cost),
          status_mission == "Success") %>% 
  group_by(country) %>% 
  summarise(Total_loss = sum(cost)/1000) %>% 
  ungroup() %>% 
  ggplot(aes(x = fct_reorder(country, Total_loss), y = Total_loss, label = paste0(round(Total_loss,2),"B")))+
  geom_col(fill = "Steelblue") +
  geom_text(hjust = -0.2) +
  coord_flip() +
  scale_y_continuous(labels = unit_format(unit = "B"), expand = expansion(add = c(0, 15))) +
  labs(
    title = " Cost of successful missions ",
    subtitle = "Pie Plot",
    caption = "All Space Missions from 1957",
    fill = "",
    x = "Company Name",
    y = "Total"
  )

ggarrange(g1, g2, nrow =2)

USA has the cost of mission failure, by some margin, over 3.5 billion USD. France, second highest cost of mission failure only has a total loss of 440 million USD (0.4 billion).

space_miss %>% dplyr::select(country, status_mission, cost) %>%  
   filter(!is.na(cost)) %>% 
  group_by(country, status_mission) %>% 
  summarise(total_rocket = round(sum(cost/1000),1)) %>% 
  ungroup() %>% 
  filter(total_rocket>0.0) %>% 
  ggplot(aes(x = country, y = total_rocket, fill = status_mission, label = paste0(round(total_rocket,2),"B"))) +
  geom_col(position = "dodge") +
  geom_text(vjust = -0.2) +
  scale_y_continuous(labels = unit_format(unit = "B"), expand = expansion(add = c(0, 20))) +
  facet_wrap(vars(status_mission), scales = "free_y") +
  theme(legend.position = "top") +
  theme(axis.text.x = element_text(
    angle = 90,
    vjust = 0.5,
    hjust = 1
  )) +
  labs(
    title = "Spent VS Losses by Country",
    subtitle = "Pie Plot",
    caption = "All Space Missions from 1957",
    fill = "",
    x = "Country Name",
    y = "Total"
  )


References