Statistics in Space

2025-09-18

Dataset: Global Space Exploration Dataset (2000-2025)

This dataset is from Kaggle.com
The Author is Atharva Soundankar
Includes information from 10 Countries including USA, China, Russia, and more
There are many columns with interesting information, we will be using only 4
```
- Country
- Success Rate (%)
- Budget (in Billions of $)
- Year
```

Comparing Means

For the first plot, using plotly, we will compare the mean Success Rate (%) and mean Budget for each country. Below is the code to produce each of these.

SuccessRateMeans<-SpaceExploration%>%
  group_by(Country) %>%
  summarize(MeanSuccess = mean(`Success Rate (%)`))%>%
  ungroup()
BudgetMeans <- SpaceExploration %>%
  group_by(Country) %>%
  summarize(MeanBudget = mean(`Budget (in Billion $)`))%>%
  ungroup()

Tables of Means

Success Rate Means
Country	MeanSuccess
China	74.99068
France	75.46302
Germany	76.24800
India	75.76531
Israel	74.10863
Japan	74.45172
Russia	75.25260
UAE	74.73443
UK	75.04348
USA	74.04276

BudgetMeans
Country	MeanBudget
China	25.65711
France	25.75058
Germany	24.12720
India	26.70636
Israel	25.44594
Japan	25.17486
Russia	25.68744
UAE	24.57738
UK	24.93084
USA	26.05418

Plot of Mean Budget v Mean Success Rate by Country Using Plotly

Adding Regression Line

Regression Line Discussion

When looking at the plot, you can see that most of the data trends down the middle - with higher budgets = higher success rates, in general. Germany is a major outlier that has the lowest budget and the highest success rate. This causes the regression line to go off the apparent trend. \[ \begin{align*} \text{Model: Mean Success} = \beta_0 + \beta_1 \cdot \text{Mean Budget} + \epsilon \\ \text{Fitted: Mean Success} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{Mean Budget} \\ \text{estimate of } \beta_0 = 78.6008 \\ \text{estimate of } \beta_1 = -0.1413 \\ \end{align*} \]

ggplot Removing Outlier Germany

ggplot Minus Germany Discussion

AS predicted in the plotly regression line discussion, the removal of the outlier Germany creates a regression line that falls in line with what you would expect by looking at the plot. In fact, almost all of the data falls within the confidence interval of 95%. The estimates of $\beta_0$ and $\beta_1$ are both vastly different when Germany is removed: \[ \begin{align*} \text{Model (without Germany): Mean Success} = \beta_0 + \beta_1 \cdot \text{Mean Budget} + \epsilon \\ \text{Fitted (without Germany): Mean Success} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{Mean Budget} \\ \text{estimate of } \beta_0 = 61.3427 \\ \text{estimate of } \beta_1 = 0.5348 \\ \end{align*} \]

Plotting Mean Budget v Mean Success by Year, using ggplot

Similar to the plotly plot, we first grouped the Success Rate and Budget, this time by Year instead of Country

YearBudget<-SpaceExploration%>%
  group_by(Year) %>%
  summarize(YearBudgetMean = mean(`Budget (in Billion $)`))%>%
  ungroup()

YearSuccess<-SpaceExploration%>%
  group_by(Year) %>%
  summarize(YearSuccessMean = mean(`Success Rate (%)`))%>%
  ungroup()

Plot of Mean Budget v Mean Success, by Year

Adding Regression Line

Budget V Success by Year Discussion

This first scatter plot is actually not very helpful. The years are seemingly all over the place with budget and success rate. It is hard to see what the trend may be. When you add the regression line in the second plot, it seems the general trend is that the budgets are getting higher, and the success rates are going down. However, roughly half of the data appears to fall outside of the confidence interval.

\[ \begin{align*} \text{Model: Mean Success} = \beta_0 + \beta_1 \cdot \text{Mean Budget} + \epsilon \\ \text{Fitted: Mean Success} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{Mean Budget} \\ \text{estimate of } \beta_0 = 84.329 \\ \text{estimate of } \beta_1 = -0.3677 \\ \end{align*} \]

Final Discussion

Comparing means can be a helpful way to see trends in data. As we saw, sometimes there can be outliers that skew the data. It is important to look for these before coming to a conclusion. Finding the reason for the outliers could be important, as well, before making any conclusions.

Additionally, we saw that sometimes it can seem rather unhelpful, with the scatter plot of the data looking like it is all over the place. By adding a regression line we can see what the trends are that the data suggest, even when the scatter plot doesn’t seem to tell us anything. When so much of the data lies outside of the confidence interval, it might be wise to be cautious about making any predictions with the data.