Winning big at the Olympics brings national prestige and an increased presence on the world stage for a country. This leads to a country to increase its image and popularity, in turn could cause increases in tourism a potential bid to host the Olympics, and other strong political relationship. Therefore, it may be useful to predict how many medals a country would win. What factors might impact a country’s success?

Filipino gymnast, Carlos Yulo, made headlines as the first male to win an Olympic gold medal in the history of the Philippines. But he continued to grow in popularity as his earnings from the country caught attention. He earned a cash prize of 10 million Philippine pesos ($172,519 USD), a brand new condo valued at over $400,000 USD, a lifetime supply of ramen and other incentives. In comparison, the US awards its gold medalists a “measly” $38,000 for winning the gold. While governments in Great Britain and Sweden do not offer any direct cash incentives. In this unit we will explore data from the 2024 Paris Olympics, how cash incentives and national factors impact medal counts for each country. We will explore the countries with the top 24 medal counts.

Step 1: Collect the Data

# Load in the data from a csv
olympics2024<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/olympicsclean2024.csv")
# Produce the first 6 rows of the data
head(olympics2024)
# Create a histogram of the response
hist(olympics2024$WeightedTotal, xlab="Weighted Medal Count", main="Histogram of Weighted Medal Count") 

# We can transform the response to reduce the skew
hist(log(olympics2024$WeightedTotal), xlab="Log of Weighted Medal Count", main="Histogram of log(Weighted Medal Count)") 

ADD YOUR NOTES ON THIS SHAPE:

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

We explore the scatter plots and correlations for each explanatory variable then classify the relationships as linear, curvilinear, or none.

# Generate the names and positions of the columns in the data
names(olympics2024)
 [1] "Country"         "WeightedTotal"   "Total"           "TGold"          
 [5] "TSilver"         "TBronze"         "GoldPrize"       "SilverPrize"    
 [9] "BronzePrize"     "YOGold"          "YOSilver"        "YOBronze"       
[13] "YOTotal"         "GDP"             "Population"      "WomanLaborForce"
[17] "TotalAthletes"   "Distance"       
# Create one scatter plot
# plot(x,y)
plot(olympics2024$GoldPrize, olympics2024$WeightedTotal,xlab="Gold Winning Cash Incentive",ylab="Weighted Total Medal Count")

plot(log(olympics2024$GoldPrize), log(olympics2024$WeightedTotal),xlab="Gold Winning Cash Incentive",ylab="Weighted Total Medal Count")

# Create several plots at once
# plot(y~x+x+...,data=DATATABLENAME)
plot(WeightedTotal~log(GoldPrize)+SilverPrize,data=olympics2024)

# Create your plots in a grid
# the par() functions allows us to put plots on the same grid
# don't worry too much about syntax
# mfrow=c(ROWS, COLUMNS), pin=size of plot in inches
# mar = margins of plotting area
par(mfrow=c(2,4),pin=c(1,1),mar=c(7.1,4.1,3,1))
# Use a loop to make a lot of plots at once
for (i in names(olympics2024)[7:18]) {
  plot(olympics2024[,i], olympics2024$WeightedTotal,xlab=i,ylab="Weighted Total Medal Count")
}

# compute correlation simply for one variable with response
cor(olympics2024$GoldPrize,olympics2024$WeightedTotal)
[1] 0.3516467
# compute correlations for all variables and round to three decimal places
round(cor(olympics2024[7:18],olympics2024$WeightedTotal),3)
                 [,1]
GoldPrize       0.352
SilverPrize     0.318
BronzePrize     0.230
YOGold          0.481
YOSilver        0.479
YOBronze        0.491
YOTotal         0.547
GDP             0.897
Population      0.634
WomanLaborForce 0.209
TotalAthletes   0.682
Distance        0.217

ADD YOUR NOTES ON THE RELATIONSHIPS WITH UNTRANSFORMED RESPONSE:

# Should we use log(weightedtotal)?
par(mfrow=c(2,4),pin=c(1,1),mar=c(7.1,4.1,3,1))
# Use a loop to make a lot of plots at once
for (i in names(olympics2024)[7:18]) {
  plot(olympics2024[,i], log(olympics2024$WeightedTotal),xlab=i,ylab="Log Weighted Total Medal Count")
}

# compute correlations for all variables and round to three decimal places
round(cor(olympics2024[7:18],log(olympics2024$WeightedTotal)),3)
                 [,1]
GoldPrize       0.260
SilverPrize     0.220
BronzePrize     0.139
YOGold          0.460
YOSilver        0.572
YOBronze        0.586
YOTotal         0.601
GDP             0.687
Population      0.488
WomanLaborForce 0.236
TotalAthletes   0.719
Distance        0.263

ADD YOUR NOTES ON THE RELATIONSHIPS WITH UNTRANSFORMED RESPONSE:

Should we go with transformed or un-transformed?

We will keep the un-transformed response because we still have variables with strong relationships.

\(WeightedTotal=\beta_0+\beta_1 GDP+\beta_2 TotalAthletes+\beta_3 Distance+\beta_4 Distance^2+\epsilon\)

Step 3: Estimate the model parameters (fit the model using R)

# store the model as an object
# lm(RESPONSE~x1+...,data=DATAFRAME)
olympicsmod1<-lm(WeightedTotal~GDP+TotalAthletes+Distance +I(Distance ^2),data=olympics2024)
summary(olympicsmod1)

Call:
lm(formula = WeightedTotal ~ GDP + TotalAthletes + Distance + 
    I(Distance^2), data = olympics2024)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.003 -15.854  -6.391  12.713  48.558 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.125e+01  1.599e+01   0.704   0.4902    
GDP           8.145e-03  1.240e-03   6.566 2.76e-06 ***
TotalAthletes 1.250e-01  4.483e-02   2.787   0.0117 *  
Distance      1.535e-03  6.145e-03   0.250   0.8055    
I(Distance^2) 9.884e-08  5.777e-07   0.171   0.8660    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27.05 on 19 degrees of freedom
Multiple R-squared:  0.8761,    Adjusted R-squared:   0.85 
F-statistic: 33.58 on 4 and 19 DF,  p-value: 2.262e-08
# Extract just the coefficients and round them 4 decimal places
round(coef(olympicsmod1),4)
  (Intercept)           GDP TotalAthletes      Distance I(Distance^2) 
      11.2538        0.0081        0.1250        0.0015        0.0000 
# Compute MSE from estimated standard error
sigma(olympicsmod1)^2
[1] 731.8691

The prediction equation is:

\(\widehat{WeightedTotal}=11.25+0.0081GDP+0.125TotalAthletes+0.0015Distance+.0000Distance^2\)

Interpretation of coefficients:

Step 4: Specify the distribution of the errors and find the estimate of the variance

Step 5: Evaluate the Utility of the model

\(WeightedTotal=\beta_0+\beta_1 GDP+\beta_2 TotalAthletes+\beta_3 Distance+\beta_4 Distance^2+\epsilon\)

First we Perform the Global F Test:

  • Hypotheses:
    • \(H_0: \beta_1= \beta_2=\beta_3=\beta_4=0\) (the model is not adequate)
    • \(H_a\):at least one of \(\beta_1 , \beta_2 , \beta_3,\beta_4\) does not equal 0 (the model is adequate)
  • Distribution of test statistic: F with 4, 19 DF
  • Test Statistic: F=33.58
  • Pvalue: <0.0001
  • Decision: 0.0001<0.05 -> REJECT H0
  • Conclusion: The model with GDP, Total Athletes, Distance and Distance squared is adequate at predicting the weighted medal count.

Then we test “the most important predictors”: Test the Individual Significance of Distance^2

  • Hypotheses:
    • \(H_0: \beta_4=0\) (the quadratic relationship does not contribute to predicting the weighted medal count)
    • \(H_a:\beta_4 \neq 0\) (the quadratic relationship contributes to predicting the weighted medal count)
  • Distribution of test statistic: T with 19 DF
  • Test Statistic: t=0.171
  • Pvalue: 0.866
  • Decision: 0.866>0.05 -> FAIL TO REJECT H0
  • Conclusion: The quadratic relationship of distance does not contribute information for predicting the weighted medal count for a country. We will remove just the higher order term and refit the model.

Refit the model

Refitting the model can change significance of other variables and will always change the beta estimates. Here the estimates changed only slightly because the term we removed was not significant.

olympicsmod2<-lm(WeightedTotal~GDP+TotalAthletes+Distance,data=olympics2024)
summary(olympicsmod2)

Call:
lm(formula = WeightedTotal ~ GDP + TotalAthletes + Distance, 
    data = olympics2024)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.431 -16.397  -6.266  12.581  48.975 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    9.531074  12.119446   0.786  0.44084    
GDP            0.008041   0.001056   7.616 2.47e-07 ***
TotalAthletes  0.128013   0.040113   3.191  0.00459 ** 
Distance       0.002538   0.001786   1.421  0.17063    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.39 on 20 degrees of freedom
Multiple R-squared:  0.8759,    Adjusted R-squared:  0.8573 
F-statistic: 47.05 on 3 and 20 DF,  p-value: 3.023e-09
round(coef(olympicsmod2),4)
  (Intercept)           GDP TotalAthletes      Distance 
       9.5311        0.0080        0.1280        0.0025 
round(confint(olympicsmod2),4)
                 2.5 %  97.5 %
(Intercept)   -15.7496 34.8118
GDP             0.0058  0.0102
TotalAthletes   0.0443  0.2117
Distance       -0.0012  0.0063

\(\widehat{WeightedTotal}=9.531+0.008GDP+0.128+0.0025Distance\)

We see from the global F test this is significant, however Distance is not significant in the model. We will remove Distance and refit the model again.

olympicsmod3<-lm(WeightedTotal~GDP+TotalAthletes,data=olympics2024)
summary(olympicsmod3)

Call:
lm(formula = WeightedTotal ~ GDP + TotalAthletes, data = olympics2024)

Residuals:
    Min      1Q  Median      3Q     Max 
-41.077 -20.727  -1.128  16.394  45.443 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   18.369036  10.652346   1.724  0.09932 .  
GDP            0.008313   0.001063   7.817 1.19e-07 ***
TotalAthletes  0.121908   0.040840   2.985  0.00706 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27.02 on 21 degrees of freedom
Multiple R-squared:  0.8633,    Adjusted R-squared:  0.8503 
F-statistic: 66.34 on 2 and 21 DF,  p-value: 8.393e-10
round(coef(olympicsmod3),4)
  (Intercept)           GDP TotalAthletes 
      18.3690        0.0083        0.1219 
round(confint(olympicsmod3),4)
                2.5 %  97.5 %
(Intercept)   -3.7837 40.5218
GDP            0.0061  0.0105
TotalAthletes  0.0370  0.2068

Further Assessment:

  • Root MSE: 27.02 (slightly better than the model with higher order term)
  • Adjusted R-Sq: 0.8503 (better than model with higher order term, and it is closer to R2)
    • After accounting for the number of predictors and observations, 85% of the variation in weighted medal count is explained by the model with GDP and total athletes.
  • Confidence Interval for Betas: We are 95% confident that for each billion dollar increase in GDP, the true the weighted medal count increases by between 0.006 and 0.0105, while the number of total athletes remains constant.

Step 6: Check the Model Assumptions

We will cover this in Unit 3

Step 7: Use the model for prediction or estimation

Back to our questions- What factors might impact a country’s success?

From our simple assessment we determined GDP and number of athletes contribute to a country’s weighted medal count. We did not determine the cash incentives contributed. Although, we could have attempted further transformations to find a strong relationship.

When can we use this model?

We should only use this model to predict for countries within the data set and for the summer Olympics. Keep in mind also that the number of events can change across each games. If we predict for countries well beyond the scope of our explanatory variables (or response), that is called extrapolation.

How do we use the model?

Let’s compare how this model performed for the US medal count (a high value) and Spain’s medal count (a middle value).

olympicsmod3<-lm(WeightedTotal~GDP+TotalAthletes,data=olympics2024)
round(coef(olympicsmod3),4)
  (Intercept)           GDP TotalAthletes 
      18.3690        0.0083        0.1219 
# The long way to get y-hat
# US
18.3690+0.0083*27360.935+0.1219*593 
[1] 317.7515
# Spain
18.3690+0.0083*1580.694713+0.1219*382
[1] 78.05457
# A much easier and exact way
# Spain is observation 20 and the US is observation 23 in the data table
# we can pull values from the model object
olympicsmod3$fitted.values
        1         2         3         4         5         6         7         8 
 88.77574  43.73900  70.20640  74.68151 201.39789  36.72361 113.29469 107.46504 
        9        10        11        12        13        14        15        16 
 85.99667  40.73704  26.46091  83.43788 103.37101  27.79535  61.30981  44.24828 
       17        18        19        20        21        22        23        24 
 51.20042  34.33086  34.55722  78.07741  37.56406  36.92241 318.09806  29.60872 
olympicsmod3$fitted.values[c(20, 23)]
       20        23 
 78.07741 318.09806 
# or we can use the predict function
predict(olympicsmod3)
        1         2         3         4         5         6         7         8 
 88.77574  43.73900  70.20640  74.68151 201.39789  36.72361 113.29469 107.46504 
        9        10        11        12        13        14        15        16 
 85.99667  40.73704  26.46091  83.43788 103.37101  27.79535  61.30981  44.24828 
       17        18        19        20        21        22        23        24 
 51.20042  34.33086  34.55722  78.07741  37.56406  36.92241 318.09806  29.60872 
predict(olympicsmod3)[c(20, 23)]
       20        23 
 78.07741 318.09806 
predict(olympicsmod3,interval="prediction")[c(20, 23),]
         fit       lwr      upr
20  78.07741  19.27739 136.8774
23 318.09806 245.08655 391.1096
# Or we can create a data frame with the new values.
# the explanatory variable names must match those in the model, but there is no response variable.
newolymp<-data.frame(GDP=c(27360.935,1580.694713),TotalAthletes=c(593,382))
newolymp
predict(olympicsmod3,newolymp, interval="prediction")
        fit       lwr      upr
1 318.09835 245.08680 391.1099
2  78.07736  19.27734 136.8774
# to compute confidence intervals for estimation we would use interval="confidence"