2025-09-18

Dataset: Global Space Exploration Dataset (2000-2025)

  • This dataset is from Kaggle.com

  • The Author is Atharva Soundankar

  • Includes information from 10 Countries including USA, China, Russia, and more

  • There are many columns with interesting information, we will be using only 4

    - Country
    - Success Rate (%)
    - Budget (in Billions of $)
    - Year

Comparing Means

For the first plot, using plotly, we will compare the mean Success Rate (%) and mean Budget for each country. Below is the code to produce each of these.

SuccessRateMeans<-SpaceExploration%>%
  group_by(Country) %>%
  summarize(MeanSuccess = mean(`Success Rate (%)`))%>%
  ungroup()
BudgetMeans <- SpaceExploration %>%
  group_by(Country) %>%
  summarize(MeanBudget = mean(`Budget (in Billion $)`))%>%
  ungroup()

Tables of Means

Success Rate Means
Country MeanSuccess
China 74.99068
France 75.46302
Germany 76.24800
India 75.76531
Israel 74.10863
Japan 74.45172
Russia 75.25260
UAE 74.73443
UK 75.04348
USA 74.04276
BudgetMeans
Country MeanBudget
China 25.65711
France 25.75058
Germany 24.12720
India 26.70636
Israel 25.44594
Japan 25.17486
Russia 25.68744
UAE 24.57738
UK 24.93084
USA 26.05418

Plot of Mean Budget v Mean Success Rate by Country Using Plotly

Adding Regression Line

Regression Line Discussion

When looking at the plot, you can see that most of the data trends down the middle - with higher budgets = higher success rates, in general. Germany is a major outlier that has the lowest budget and the highest success rate. This causes the regression line to go off the apparent trend. \[ \begin{align*} \text{Model: Mean Success} = \beta_0 + \beta_1 \cdot \text{Mean Budget} + \epsilon \\ \text{Fitted: Mean Success} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{Mean Budget} \\ \text{estimate of } \beta_0 = 78.6008 \\ \text{estimate of } \beta_1 = -0.1413 \\ \end{align*} \]

ggplot Removing Outlier Germany

ggplot Minus Germany Discussion

AS predicted in the plotly regression line discussion, the removal of the outlier Germany creates a regression line that falls in line with what you would expect by looking at the plot. In fact, almost all of the data falls within the confidence interval of 95%. The estimates of \(\beta_0\) and \(\beta_1\) are both vastly different when Germany is removed: \[ \begin{align*} \text{Model (without Germany): Mean Success} = \beta_0 + \beta_1 \cdot \text{Mean Budget} + \epsilon \\ \text{Fitted (without Germany): Mean Success} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{Mean Budget} \\ \text{estimate of } \beta_0 = 61.3427 \\ \text{estimate of } \beta_1 = 0.5348 \\ \end{align*} \]

Plotting Mean Budget v Mean Success by Year, using ggplot

Similar to the plotly plot, we first grouped the Success Rate and Budget, this time by Year instead of Country

YearBudget<-SpaceExploration%>%
  group_by(Year) %>%
  summarize(YearBudgetMean = mean(`Budget (in Billion $)`))%>%
  ungroup()

YearSuccess<-SpaceExploration%>%
  group_by(Year) %>%
  summarize(YearSuccessMean = mean(`Success Rate (%)`))%>%
  ungroup()

Plot of Mean Budget v Mean Success, by Year

Adding Regression Line

Budget V Success by Year Discussion

This first scatter plot is actually not very helpful. The years are seemingly all over the place with budget and success rate. It is hard to see what the trend may be. When you add the regression line in the second plot, it seems the general trend is that the budgets are getting higher, and the success rates are going down. However, roughly half of the data appears to fall outside of the confidence interval.

\[ \begin{align*} \text{Model: Mean Success} = \beta_0 + \beta_1 \cdot \text{Mean Budget} + \epsilon \\ \text{Fitted: Mean Success} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{Mean Budget} \\ \text{estimate of } \beta_0 = 84.329 \\ \text{estimate of } \beta_1 = -0.3677 \\ \end{align*} \]

Final Discussion

Comparing means can be a helpful way to see trends in data. As we saw, sometimes there can be outliers that skew the data. It is important to look for these before coming to a conclusion. Finding the reason for the outliers could be important, as well, before making any conclusions.

Additionally, we saw that sometimes it can seem rather unhelpful, with the scatter plot of the data looking like it is all over the place. By adding a regression line we can see what the trends are that the data suggest, even when the scatter plot doesn’t seem to tell us anything. When so much of the data lies outside of the confidence interval, it might be wise to be cautious about making any predictions with the data.