Winning big at the Olympics brings national prestige and an increased presence on the world stage for a country. This leads to a country to increase its image and popularity, in turn could cause increases in tourism a potential bid to host the Olympics, and other strong political relationship. Therefore, it may be useful to predict how many medals a country would win. What factors might impact a country’s success?

Filipino gymnast, Carlos Yulo, made headlines as the first male to win an Olympic gold medal in the history of the Philippines. But he continued to grow in popularity as his earnings from the country caught attention. He earned a cash prize of 10 million Philippine pesos ($172,519 USD), a brand new condo valued at over $400,000 USD, a lifetime supply of ramen and other incentives. In comparison, the US awards its gold medalists a “measly” $38,000 for winning the gold. While governments in Great Britain and Sweden do not offer any direct cash incentives. In this unit we will explore data from the 2024 Paris Olympics, how cash incentives and national factors impact medal counts for each country. We will explore the countries with the top 24 medal counts.

Country: Team Name
WeightedTotal: Weighted medal count from Paris 2024 (4 points for each gold, 2 for each silver, 1 for bronze)
Total: Non-weighted total medal count for a country from Paris 2024
TGold: Total gold medal count for a country from Paris 2024
TSilver: Total gold medal count for a country from Paris 2024
TBronze: Total gold medal count for a country from Paris 2024
GoldPrize: Government cash incentive for a gold medal (in USD)
SilverPrize : Government cash incentive for a silver medal (in USD)
BronzePrize : Government cash incentive for a bronze medal (in USD)
YOGold: Total gold medal count for a country from Buenos Aires Youth Olympics in 2018
YOSilver: Total silver medal count for a country from Buenos Aires Youth Olympics in 2018
YOBronze: Total bronze medal count for a country from Buenos Aires Youth Olympics in 2018
YOTotal: Total medal count for a country from Buenos Aires Youth Olympics in 2018
GDP: country’s gross domestic product in 2023 (measured in billions of USD)
Population: country’s population in 2023 (measured in millions)
WomanLaborForce: Percentage of woman working in the country in 2023
TotalAthletes: Total number of athletes participation in Paris 2024 Olympics
Distance: Average distance from country to Paris (rounded to 100)

Step 1: Collect the Data

The data were compiled from several sources including ESPN, Olympic International Committee, and World Bank. Some values were cross references on Wikipedia. We have no reason to question the credibility of these sources.
Are the data organized for MLR?- each row contains information about one observation (experimental unit) including all of the explanatory predictors and the response.
Check the appropriateness of response variable for regression: View a histogram of response variable. Primarily, it should be continuous with few outliers. Data should not be recorded over time. It is sometimes useful if the response variable is unimodal and symmetric, but that is not a requirement. (Further, we will not use “rank” as a response variable in MLR. We need a continuous variable with several values of each response, if the variable is discrete.)

# Load in the data from a csv
olympics2024<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/olympicsclean2024.csv")
# Produce the first 6 rows of the data
head(olympics2024)

# Create a histogram of the response
hist(olympics2024$WeightedTotal, xlab="Weighted Medal Count", main="Histogram of Weighted Medal Count")

# We can transform the response to reduce the skew
hist(log(olympics2024$WeightedTotal), xlab="Log of Weighted Medal Count", main="Histogram of log(Weighted Medal Count)")

ADD YOUR NOTES ON THIS SHAPE:

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

We explore the scatter plots and correlations for each explanatory variable then classify the relationships as linear, curvilinear, or none.

# Generate the names and positions of the columns in the data
names(olympics2024)

 [1] "Country"         "WeightedTotal"   "Total"           "TGold"          
 [5] "TSilver"         "TBronze"         "GoldPrize"       "SilverPrize"    
 [9] "BronzePrize"     "YOGold"          "YOSilver"        "YOBronze"       
[13] "YOTotal"         "GDP"             "Population"      "WomanLaborForce"
[17] "TotalAthletes"   "Distance"

# Create one scatter plot
# plot(x,y)
plot(olympics2024$GoldPrize, olympics2024$WeightedTotal,xlab="Gold Winning Cash Incentive",ylab="Weighted Total Medal Count")

plot(log(olympics2024$GoldPrize), log(olympics2024$WeightedTotal),xlab="Gold Winning Cash Incentive",ylab="Weighted Total Medal Count")

# Create several plots at once
# plot(y~x+x+...,data=DATATABLENAME)
plot(WeightedTotal~log(GoldPrize)+SilverPrize,data=olympics2024)

# Create your plots in a grid
# the par() functions allows us to put plots on the same grid
# don't worry too much about syntax
# mfrow=c(ROWS, COLUMNS), pin=size of plot in inches
# mar = margins of plotting area
par(mfrow=c(2,4),pin=c(1,1),mar=c(7.1,4.1,3,1))
# Use a loop to make a lot of plots at once
for (i in names(olympics2024)[7:18]) {
  plot(olympics2024[,i], olympics2024$WeightedTotal,xlab=i,ylab="Weighted Total Medal Count")
}

# compute correlation simply for one variable with response
cor(olympics2024$GoldPrize,olympics2024$WeightedTotal)

[1] 0.3516467

# compute correlations for all variables and round to three decimal places
round(cor(olympics2024[7:18],olympics2024$WeightedTotal),3)

                 [,1]
GoldPrize       0.352
SilverPrize     0.318
BronzePrize     0.230
YOGold          0.481
YOSilver        0.479
YOBronze        0.491
YOTotal         0.547
GDP             0.897
Population      0.634
WomanLaborForce 0.209
TotalAthletes   0.682
Distance        0.217

ADD YOUR NOTES ON THE RELATIONSHIPS WITH UNTRANSFORMED RESPONSE:

# Should we use log(weightedtotal)?
par(mfrow=c(2,4),pin=c(1,1),mar=c(7.1,4.1,3,1))
# Use a loop to make a lot of plots at once
for (i in names(olympics2024)[7:18]) {
  plot(olympics2024[,i], log(olympics2024$WeightedTotal),xlab=i,ylab="Log Weighted Total Medal Count")
}

# compute correlations for all variables and round to three decimal places
round(cor(olympics2024[7:18],log(olympics2024$WeightedTotal)),3)

                 [,1]
GoldPrize       0.260
SilverPrize     0.220
BronzePrize     0.139
YOGold          0.460
YOSilver        0.572
YOBronze        0.586
YOTotal         0.601
GDP             0.687
Population      0.488
WomanLaborForce 0.236
TotalAthletes   0.719
Distance        0.263

ADD YOUR NOTES ON THE RELATIONSHIPS WITH UNTRANSFORMED RESPONSE:

Should we go with transformed or un-transformed?

We will keep the un-transformed response because we still have variables with strong relationships.

We see GDP (r=0.897) and Total Athletes (r=0.682) have the strongest relationships
Potentially, distance has a curvilinear relationship.
Therefore, our hypothesized model is:

$WeightedTotal=\beta_0+\beta_1 GDP+\beta_2 TotalAthletes+\beta_3 Distance+\beta_4 Distance^2+\epsilon$

Step 3: Estimate the model parameters (fit the model using R)

# store the model as an object
# lm(RESPONSE~x1+...,data=DATAFRAME)
olympicsmod1<-lm(WeightedTotal~GDP+TotalAthletes+Distance +I(Distance ^2),data=olympics2024)
summary(olympicsmod1)


Call:
lm(formula = WeightedTotal ~ GDP + TotalAthletes + Distance + 
    I(Distance^2), data = olympics2024)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.003 -15.854  -6.391  12.713  48.558 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.125e+01  1.599e+01   0.704   0.4902    
GDP           8.145e-03  1.240e-03   6.566 2.76e-06 ***
TotalAthletes 1.250e-01  4.483e-02   2.787   0.0117 *  
Distance      1.535e-03  6.145e-03   0.250   0.8055    
I(Distance^2) 9.884e-08  5.777e-07   0.171   0.8660    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27.05 on 19 degrees of freedom
Multiple R-squared:  0.8761,    Adjusted R-squared:   0.85 
F-statistic: 33.58 on 4 and 19 DF,  p-value: 2.262e-08

# Extract just the coefficients and round them 4 decimal places
round(coef(olympicsmod1),4)

  (Intercept)           GDP TotalAthletes      Distance I(Distance^2) 
      11.2538        0.0081        0.1250        0.0015        0.0000

# Compute MSE from estimated standard error
sigma(olympicsmod1)^2

[1] 731.8691

The prediction equation is:

$\widehat{WeightedTotal}=11.25+0.0081GDP+0.125TotalAthletes+0.0015Distance+.0000Distance^2$

Interpretation of coefficients:

For example: For a billion dollar increase in GDP, the predicted weighted medal total is expected to increase by 0.0081, given the other explanatory variables are held constant.
The coefficients for the linear and curvilinear components of distance are not practical to interpret individually.

Step 4: Specify the distribution of the errors and find the estimate of the variance

$\epsilon \overset{\mathrm{iid}}{\sim} N(0,\sigma^2 )$
estimate of $\sigma$ is $\sqrt{MSE}=27.05$
estimate of $\sigma^2$ is $MSE=731.8691$
Approximately 95% of our predictions will be within 2*27.05=54.1 medals of the actual weighted medal total. Whether this is “good” depends greatly on the goals of prediction here.

Step 5: Evaluate the Utility of the model

$WeightedTotal=\beta_0+\beta_1 GDP+\beta_2 TotalAthletes+\beta_3 Distance+\beta_4 Distance^2+\epsilon$

First we Perform the Global F Test:

Hypotheses:
- $H_0: \beta_1= \beta_2=\beta_3=\beta_4=0$ (the model is not adequate)
- $H_a$:at least one of $\beta_1 , \beta_2 , \beta_3,\beta_4$ does not equal 0 (the model is adequate)
Distribution of test statistic: F with 4, 19 DF
Test Statistic: F=33.58
Pvalue: <0.0001
Decision: 0.0001<0.05 -> REJECT H0
Conclusion: The model with GDP, Total Athletes, Distance and Distance squared is adequate at predicting the weighted medal count.

Then we test “the most important predictors”: Test the Individual Significance of Distance^2

Hypotheses:
- $H_0: \beta_4=0$ (the quadratic relationship does not contribute to predicting the weighted medal count)
- $H_a:\beta_4 \neq 0$ (the quadratic relationship contributes to predicting the weighted medal count)
Distribution of test statistic: T with 19 DF
Test Statistic: t=0.171
Pvalue: 0.866
Decision: 0.866>0.05 -> FAIL TO REJECT H0
Conclusion: The quadratic relationship of distance does not contribute information for predicting the weighted medal count for a country. We will remove just the higher order term and refit the model.

Refit the model

Refitting the model can change significance of other variables and will always change the beta estimates. Here the estimates changed only slightly because the term we removed was not significant.

olympicsmod2<-lm(WeightedTotal~GDP+TotalAthletes+Distance,data=olympics2024)
summary(olympicsmod2)


Call:
lm(formula = WeightedTotal ~ GDP + TotalAthletes + Distance, 
    data = olympics2024)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.431 -16.397  -6.266  12.581  48.975 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    9.531074  12.119446   0.786  0.44084    
GDP            0.008041   0.001056   7.616 2.47e-07 ***
TotalAthletes  0.128013   0.040113   3.191  0.00459 ** 
Distance       0.002538   0.001786   1.421  0.17063    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.39 on 20 degrees of freedom
Multiple R-squared:  0.8759,    Adjusted R-squared:  0.8573 
F-statistic: 47.05 on 3 and 20 DF,  p-value: 3.023e-09

round(coef(olympicsmod2),4)

  (Intercept)           GDP TotalAthletes      Distance 
       9.5311        0.0080        0.1280        0.0025

round(confint(olympicsmod2),4)

                 2.5 %  97.5 %
(Intercept)   -15.7496 34.8118
GDP             0.0058  0.0102
TotalAthletes   0.0443  0.2117
Distance       -0.0012  0.0063

$\widehat{WeightedTotal}=9.531+0.008GDP+0.128+0.0025Distance$

We see from the global F test this is significant, however Distance is not significant in the model. We will remove Distance and refit the model again.

olympicsmod3<-lm(WeightedTotal~GDP+TotalAthletes,data=olympics2024)
summary(olympicsmod3)


Call:
lm(formula = WeightedTotal ~ GDP + TotalAthletes, data = olympics2024)

Residuals:
    Min      1Q  Median      3Q     Max 
-41.077 -20.727  -1.128  16.394  45.443 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   18.369036  10.652346   1.724  0.09932 .  
GDP            0.008313   0.001063   7.817 1.19e-07 ***
TotalAthletes  0.121908   0.040840   2.985  0.00706 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27.02 on 21 degrees of freedom
Multiple R-squared:  0.8633,    Adjusted R-squared:  0.8503 
F-statistic: 66.34 on 2 and 21 DF,  p-value: 8.393e-10

round(coef(olympicsmod3),4)

  (Intercept)           GDP TotalAthletes 
      18.3690        0.0083        0.1219

round(confint(olympicsmod3),4)

                2.5 %  97.5 %
(Intercept)   -3.7837 40.5218
GDP            0.0061  0.0105
TotalAthletes  0.0370  0.2068

Further Assessment:

Root MSE: 27.02 (slightly better than the model with higher order term)
Adjusted R-Sq: 0.8503 (better than model with higher order term, and it is closer to R2)
- After accounting for the number of predictors and observations, 85% of the variation in weighted medal count is explained by the model with GDP and total athletes.
Confidence Interval for Betas: We are 95% confident that for each billion dollar increase in GDP, the true the weighted medal count increases by between 0.006 and 0.0105, while the number of total athletes remains constant.

Step 6: Check the Model Assumptions

We will cover this in Unit 3

Step 7: Use the model for prediction or estimation

Back to our questions- What factors might impact a country’s success?

From our simple assessment we determined GDP and number of athletes contribute to a country’s weighted medal count. We did not determine the cash incentives contributed. Although, we could have attempted further transformations to find a strong relationship.

When can we use this model?

We should only use this model to predict for countries within the data set and for the summer Olympics. Keep in mind also that the number of events can change across each games. If we predict for countries well beyond the scope of our explanatory variables (or response), that is called extrapolation.

How do we use the model?

Let’s compare how this model performed for the US medal count (a high value) and Spain’s medal count (a middle value).

olympicsmod3<-lm(WeightedTotal~GDP+TotalAthletes,data=olympics2024)
round(coef(olympicsmod3),4)

  (Intercept)           GDP TotalAthletes 
      18.3690        0.0083        0.1219

# The long way to get y-hat
# US
18.3690+0.0083*27360.935+0.1219*593

[1] 317.7515

# Spain
18.3690+0.0083*1580.694713+0.1219*382

[1] 78.05457

NOTE: THE COEFFICIENTS HAVE BEEN ROUDNED, so there is slight error from the true estimates below. If we use this model and the statistics from the data, we would have predicted the US to have a weighed count of 317 and Spain to have a weighed count of 78.

# A much easier and exact way
# Spain is observation 20 and the US is observation 23 in the data table
# we can pull values from the model object
olympicsmod3$fitted.values

        1         2         3         4         5         6         7         8 
 88.77574  43.73900  70.20640  74.68151 201.39789  36.72361 113.29469 107.46504 
        9        10        11        12        13        14        15        16 
 85.99667  40.73704  26.46091  83.43788 103.37101  27.79535  61.30981  44.24828 
       17        18        19        20        21        22        23        24 
 51.20042  34.33086  34.55722  78.07741  37.56406  36.92241 318.09806  29.60872

olympicsmod3$fitted.values[c(20, 23)]

       20        23 
 78.07741 318.09806

# or we can use the predict function
predict(olympicsmod3)

        1         2         3         4         5         6         7         8 
 88.77574  43.73900  70.20640  74.68151 201.39789  36.72361 113.29469 107.46504 
        9        10        11        12        13        14        15        16 
 85.99667  40.73704  26.46091  83.43788 103.37101  27.79535  61.30981  44.24828 
       17        18        19        20        21        22        23        24 
 51.20042  34.33086  34.55722  78.07741  37.56406  36.92241 318.09806  29.60872

predict(olympicsmod3)[c(20, 23)]

       20        23 
 78.07741 318.09806

predict(olympicsmod3,interval="prediction")[c(20, 23),]

         fit       lwr      upr
20  78.07741  19.27739 136.8774
23 318.09806 245.08655 391.1096

# Or we can create a data frame with the new values.
# the explanatory variable names must match those in the model, but there is no response variable.
newolymp<-data.frame(GDP=c(27360.935,1580.694713),TotalAthletes=c(593,382))
newolymp

predict(olympicsmod3,newolymp, interval="prediction")

        fit       lwr      upr
1 318.09835 245.08680 391.1099
2  78.07736  19.27734 136.8774

# to compute confidence intervals for estimation we would use interval="confidence"

If we use this model and the statistics from the data table, we would have predicted the US to have a weighted medal count of 318.098 exactly. Further, we are 95% confident that the US to have a weighted medal count between 245.086 and 391.10, given their GDP of 27360.935 and 593 total athletes.
If we use this model and the statistics from the data table, we would have predicted Spain to have a weighted medal count of 78.07 exactly. Further, we are 95% confident that Spain to have a weighted medal count between 19.27 and 136.87, given their GDP of 1580.69 and 382 total athletes.
The US actually had a weighted total of 290 and Spain had an actual total of 37. How accurate was this model at predicting?
Although this model did well for prediction because our confidence interval contained the realized response, and overall, this model is good for predicting, based on the assessment metrics, we should not be used for prediction beyond these data. We should update this model with recent data if we want to continue using it and want to consider fine tuning to account for the skew in the response.

STAT 3220 Unit 1