Winning big at the Olympics brings national prestige and an increased presence on the world stage for a country. This leads to a country to increase its image and popularity, in turn could cause increases in tourism a potential bid to host the Olympics, and other strong political relationship. Therefore, it may be useful to predict how many medals a country would win. What factors might impact a country’s success?

Filipino gymnast, Carlos Yulo, made headlines as the first male to win an Olympic gold medal in the history of the Philippines. But he continued to grow in popularity as his earnings from the country caught attention. He earned a cash prize of 10 million Philippine pesos ($172,519 USD), a brand new condo valued at over $400,000 USD, a lifetime supply of ramen and other incentives. In comparison, the US awards its gold medalists a “measly” $38,000 for winning the gold. While governments in Great Britain and Sweden do not offer any direct cash incentives. In this unit we will explore data from the 2024 Paris Olympics, how cash incentives and national factors impact medal counts for each country. We will explore the countries with the top 24 medal counts.

Country: Team Name
WeightedTotal: Weighted medal count from Paris 2024 (4 points for each gold, 2 for each silver, 1 for bronze)
Total: Non-weighted total medal count for a country from Paris 2024
TGold: Total gold medal count for a country from Paris 2024
TSilver: Total gold medal count for a country from Paris 2024
TBronze: Total gold medal count for a country from Paris 2024
GoldPrize: Government cash incentive for a gold medal (in USD)
SilverPrize : Government cash incentive for a silver medal (in USD)
BronzePrize : Government cash incentive for a bronze medal (in USD)
YOGold: Total gold medal count for a country from Buenos Aires Youth Olympics in 2018
YOSilver: Total silver medal count for a country from Buenos Aires Youth Olympics in 2018
YOBronze: Total bronze medal count for a country from Buenos Aires Youth Olympics in 2018
YOTotal: Total medal count for a country from Buenos Aires Youth Olympics in 2018
GDP: country’s gross domestic product in 2023 (measured in billions of USD)
Population: country’s population in 2023 (measured in millions)
WomanLaborForce: Percentage of woman working in the country in 2023
TotalAthletes: Total number of athletes participation in Paris 2024 Olympics
Distance: Average distance from country to Paris (rounded to 100)

Step 1: Collect the Data

Data Compilation: The data were compiled from several sources including ESPN, Olympic International Committee, and World Bank. Some values were cross references on Wikipedia. We have no reason to question the credibility of these sources.
Data Organization: Are the data organized for MLR?- each row contains information about one observation (experimental unit) including all of the explanatory predictors and the response.
Suitability of response variable for regression: View a histogram of response variable. Primarily, it should be continuous with few outliers. Data should not be recorded over time. It is sometimes useful if the response variable is unimodal and symmetric, but that is NOT a requirement. (Further, we will not use “rank” as a response variable in MLR. We need a continuous variable with several values of each response, if the variable is discrete.)

# Load in the data from a csv
olympics2024<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/olympicsclean2024.csv")
# Produce the first 6 rows of the data
head(olympics2024)

# Create a histogram of the response
hist(olympics2024$WeightedTotal, xlab="Weighted Medal Count", main="Histogram of Weighted Medal Count")

ADD YOUR NOTES ON THE SHAPE + DISTRIBUTION OF RESPONSE:

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

We explore the scatter plots and correlations for each explanatory variable then classify the relationships as linear, curvilinear, other, or none.

As a class, let’s review some variables of interest together: Gold Prize, GDP, and Women Labor Force.

# Generate the names and positions of the columns in the data
names(olympics2024)

 [1] "Country"         "WeightedTotal"   "Total"           "TGold"          
 [5] "TSilver"         "TBronze"         "GoldPrize"       "SilverPrize"    
 [9] "BronzePrize"     "YOGold"          "YOSilver"        "YOBronze"       
[13] "YOTotal"         "GDP"             "Population"      "WomanLaborForce"
[17] "TotalAthletes"   "Distance"

# Create one scatter plot
# plot(x,y)
plot(olympics2024$GoldPrize, olympics2024$WeightedTotal,xlab="Gold Winning Cash Incentive",ylab="Weighted Total Medal Count")

plot(olympics2024$GDP, olympics2024$WeightedTotal,xlab="GDP",ylab="Weighted Total Medal Count")

plot(olympics2024$WomanLaborForce, olympics2024$WeightedTotal,xlab="Woman Labor Force",ylab="Weighted Total Medal Count")

# We can also make several plots at once:
# plot(y~x1+x2+x3...,data)
plot(WeightedTotal~GoldPrize+GDP+WomanLaborForce,data=olympics2024)

# We can make these plots even cleaner by plotting them within one window.
# the par() functions allows us to put plots on the same grid
# don't worry too much about syntax
# mfrow=c(ROWS, COLUMNS), pin=size of plot in inches
# mar = margins of plotting area
par(mfrow=c(1,3),pin=c(1.5,1.5),oma=c(3,3,3,3))
# Use a loop to make a lot of plots at once
for (i in names(olympics2024)[c(7,14,16)]) {
  plot(olympics2024[,i], olympics2024$WeightedTotal,xlab=i,ylab="Weighted Total Medal Count")
}
mtext(text="Weighted Total Medal Count",side=2,line=0,outer=TRUE)

# compute correlation simply for one variable with response
cor(olympics2024$GoldPrize,olympics2024$WeightedTotal)

[1] 0.3516467

cor(olympics2024$GDP,olympics2024$WeightedTotal)

[1] 0.8974229

cor(olympics2024$WomanLaborForce,olympics2024$WeightedTotal)

[1] 0.2092121

# compute correlations for all variables and round to three decimal places
round(cor(olympics2024[c(7,14,16)],olympics2024[,2]),3)

                 [,1]
GoldPrize       0.352
GDP             0.897
WomanLaborForce 0.209

We see that both Gold Prize and Woman labor Force have weak correlations (below about 0.35). In combination with the scatter plots there does not appear to be a non-linear relationship that would make that relationship stronger.

We see that GDP does have a strong correlation (almost 0.89). In the scatter plot, we see there are two countries with very extreme values. What are they?

# Extract the countries with the large GDPs
olympics2024$Country[olympics2024$GDP>15000]

[1] "China"         "United States"

These observations are pulling the SLR so strongly that we are over-fitting the linear model. How can we minimize the impact of these countries? And potentially create a more representative relationship? Transform the explanatory variable GDP!

# We can transform variables directly into the plot and cor functions without changing the variables
#log()

plot(log(olympics2024$GDP), olympics2024$WeightedTotal,xlab="Log of GDP",ylab="Weighted Total Medal Count")

cor(log(olympics2024$GDP),olympics2024$WeightedTotal)

[1] 0.8186216

#sqrt()
plot(sqrt(olympics2024$GDP), olympics2024$WeightedTotal,xlab="Square Root of GDP",ylab="Weighted Total Medal Count")

cor(sqrt(olympics2024$GDP),olympics2024$WeightedTotal)

[1] 0.9301069

# We can also transform the response
plot(log(olympics2024$GDP), log(olympics2024$WeightedTotal),xlab="Log of GDP",ylab="Log of Weighted Total Medal Count")

cor(log(olympics2024$GDP),log(olympics2024$WeightedTotal))

[1] 0.7853769

From this preliminary analysis, we see that perhaps the square root transformation of GDP will produce a linear relationship for these data.

Important notes on transformations:

If we transform the response variable to improve the relationship for one explanatory variable, we must have that transformation for every variable. Sometimes this is not worth it.
We can have some explanatory variables transformed and some left transformed.
We do not need to perform the same transformations to the explanatory variables and/or response.
Sometimes a “transformation” includes fitting a “curvilinear” term, in which case we retain the linear and higher order terms ($x$ and $x^2$)

TL;DR - Our goal through the exploratory data analysis phase is to identify the univariate relationships between the explanatory variables and the response. We do this by creating scatter plots and correlations for each variable then attempting any necessary transformations to find the most linear relationships that best fit all of the data. Then classify these relationships and select the variables with the strongest relationships. We will learn a more objective way to do this.

From these three variables- Gold Prize, GDP, and Woman Labor Force, we would write the hypothesized linear model as:

$WeightedTotal=\beta_0+\beta_1 sqrt(GDP)+\beta_2 GoldPrize+\beta_3 WomanLaborForce+\epsilon$

YOUR TURN: Go to Unit 1 Classwork.

Add your notes about the remaining 9 variables.

Variable Name	Description of relationships and transformations
GoldPrize
SilverPrize
BronzePrize
YOGold
YOSilver
YOBronze
YOTotal
GDP
Population
WomanLaborForce
TotalAthletes
Distance

# All scatter plots
par(mfrow=c(2,4),pin=c(1,1),mar=c(7.1,4.1,3,1))
# Use a loop to make a lot of plots at once
for (i in names(olympics2024)[7:18]) {
  plot((olympics2024[,i]), (olympics2024$WeightedTotal),xlab=i,ylab="Log Weighted Total Medal Count")
}

# compute correlations for all variables and round to three decimal places
round(cor((olympics2024[7:18]),(olympics2024$WeightedTotal)),3)

                 [,1]
GoldPrize       0.352
SilverPrize     0.318
BronzePrize     0.230
YOGold          0.481
YOSilver        0.479
YOBronze        0.491
YOTotal         0.547
GDP             0.897
Population      0.634
WomanLaborForce 0.209
TotalAthletes   0.682
Distance        0.217

After all EDA, we may decide to fit the following model (You likely chose different variables):

We will keep the un-transformed response because we still have variables with strong relationships.

We see sqrt(GDP) (r=0.930) and Total Athletes (r=0.682) have the strongest relationships
Potentially, distance has a curvilinear relationship.
Therefore, our hypothesized model is:

$WeightedTotal=\beta_0+\beta_1 \sqrt{GDP}+\beta_2 TotalAthletes+\beta_3 Distance+\beta_4 Distance^2+\epsilon$

Step 3: Estimate the model parameters (fit the model using R)

# store the model as an object
# We will use the naming convention DataNameMod1, etc
# lm(RESPONSE~x1+...,data=DATAFRAME)
# we can add transformations directly into the lm function
# to add a higher order term, we must use the I() syntax as below.
olympicsmod1<-lm(WeightedTotal~sqrt(GDP)+TotalAthletes+Distance +I(Distance ^2),data=olympics2024)
summary(olympicsmod1)


Call:
lm(formula = WeightedTotal ~ sqrt(GDP) + TotalAthletes + Distance + 
    I(Distance^2), data = olympics2024)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.067 -13.973   0.616  16.911  33.705 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -8.483e+00  1.362e+01  -0.623    0.541    
sqrt(GDP)      1.658e+00  2.267e-01   7.314 6.18e-07 ***
TotalAthletes  3.758e-02  4.927e-02   0.763    0.455    
Distance      -1.952e-03  5.897e-03  -0.331    0.744    
I(Distance^2)  4.817e-07  5.578e-07   0.863    0.399    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.04 on 19 degrees of freedom
Multiple R-squared:  0.8938,    Adjusted R-squared:  0.8715 
F-statistic: 39.99 on 4 and 19 DF,  p-value: 5.302e-09

# Extract just the coefficients 
olympicsmod1$coefficients

  (Intercept)     sqrt(GDP) TotalAthletes      Distance I(Distance^2) 
-8.482918e+00  1.658033e+00  3.758207e-02 -1.951541e-03  4.816828e-07

# Or use the coef function and round them 4 decimal places
round(coef(olympicsmod1),4)

  (Intercept)     sqrt(GDP) TotalAthletes      Distance I(Distance^2) 
      -8.4829        1.6580        0.0376       -0.0020        0.0000

# Compute MSE from estimated standard error
sigma(olympicsmod1)^2

[1] 627.0386

The prediction equation is:

$\widehat{WeightedTotal}=-8.4829+1.658\sqrt{GDP}+0.0376TotalAthletes-0.002Distance+.0000Distance^2$

Interpretation of coefficients:

For a billion dollar increase in the square root of GDP, the predicted weighted medal total is expected to increase by 1.658, given the other explanatory variables are held constant.
Write the interpretation of the coefficient for total athletes:
The coefficients for the linear and curvilinear components of distance are not practical to interpret individually.

Step 4: Specify the distribution of the errors and find the estimate of the variance

$\epsilon \overset{\mathrm{iid}}{\sim} N(0,\sigma^2 )$
estimate of $\sigma$ is $\sqrt{MSE}=25.04$
estimate of $\sigma^2$ is $MSE=627.0386$
Approximately 95% of our predictions will be within 2*25.04=50.08 medals of the actual weighted medal total. Whether this is “good” depends greatly on the goals of prediction here.

Step 5: Evaluate the Utility of the model

$WeightedTotal=\beta_0+\beta_1 sqrt(GDP)+\beta_2 TotalAthletes+\beta_3 Distance+\beta_4 Distance^2+\epsilon$

First we Perform the Global F Test:

Hypotheses:
- $H_0: \beta_1= \beta_2=\beta_3=\beta_4=0$ (the model is not adequate)
- $H_a$:at least one of $\beta_1 , \beta_2 , \beta_3,\beta_4$ does not equal 0 (the model is adequate)
Distribution of test statistic: F with 4, 19 DF
Test Statistic: F=39.99
Pvalue: <0.0001=0.0000000053 (very small)
Decision: 0.0001<0.1 -> REJECT H0
Conclusion: The model with sqrt(GDP), Total Athletes, Distance and Distance squared is adequate at predicting the weighted medal count.

Then we test “the most important predictors”: Test the Individual Significance of Distance^2

Hypotheses:
- $H_0: \beta_4=0$ (the quadratic relationship does not contribute to predicting the weighted medal count)
- $H_a:\beta_4 \neq 0$ (the quadratic relationship contributes to predicting the weighted medal count)
Distribution of test statistic: T with 19 DF
Test Statistic: t=0.863
Pvalue: 0.399
Decision: 0.399>0.1 -> FAIL TO REJECT H0
Conclusion: The quadratic relationship of distance does not contribute information for predicting the weighted medal count for a country. We will remove just the higher order term and refit the model.

Refit the model

Refitting the model can change significance of other variables and will always change the beta estimates. Here the estimates changed only slightly because the term we removed was not significant.

#refit the model by removing distance^2
olympicsmod2<-lm(WeightedTotal~sqrt(GDP)+TotalAthletes+Distance,data=olympics2024)
summary(olympicsmod2)


Call:
lm(formula = WeightedTotal ~ sqrt(GDP) + TotalAthletes + Distance, 
    data = olympics2024)

Residuals:
    Min      1Q  Median      3Q     Max 
-53.247 -10.839   0.496  17.119  34.575 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -15.315035  11.018564  -1.390   0.1798    
sqrt(GDP)       1.550783   0.188425   8.230  7.5e-08 ***
TotalAthletes   0.058329   0.042741   1.365   0.1875    
Distance        0.002928   0.001675   1.748   0.0958 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24.88 on 20 degrees of freedom
Multiple R-squared:  0.8897,    Adjusted R-squared:  0.8731 
F-statistic: 53.75 on 3 and 20 DF,  p-value: 9.388e-10

round(coef(olympicsmod2),4)

  (Intercept)     sqrt(GDP) TotalAthletes      Distance 
     -15.3150        1.5508        0.0583        0.0029

$\widehat{WeightedTotal}=-15.3150+1.5508\sqrt{GDP}+0.0583TotalAthletes+0.0029Distance$

We see from the global F test this is significant. Although Total Athletes is not significant individually, we will keep it in the model.

# confidence intervals for betas
# confint(model, level=.95)
round(confint(olympicsmod2,level=.9),4)

                   5 %   95 %
(Intercept)   -34.3190 3.6889
sqrt(GDP)       1.2258 1.8758
TotalAthletes  -0.0154 0.1320
Distance        0.0000 0.0058

Further Assessment:

Root MSE: 24.88 (slightly better than the model with higher order term)
Adjusted R-Sq: 0.8731 (better than model with higher order term, and it is closer to R2)
- After accounting for the number of predictors and observations, 87.3% of the variation in weighted medal count is explained by the model with square root of GDP, total athletes, and distance.
Confidence Interval for Betas: We are 90% confident that for each billion dollar increase in the square root of GDP, the true the weighted medal count increases by between 1.2258 and 1.8758, while the number of total athletes and distance remains constant.
- Add your interpretation for total athletes.
- Add your interpretation for distance.
- What is the relationship between the results from the t-tests and the confidence intervals?

Step 6: Check the Model Assumptions

We will cover this in Unit 3

Step 7: Use the model for prediction or estimation

Prediction

Prediction is used when we want to predict a single observation from its set of predictors. We are using prediction here because we are making a guess about the outcome of a single country.

Back to our questions- What factors might impact a country’s success?

From our simple assessment we determined square root of GDP, number of athletes, and distance to Paris contribute to a country’s weighted medal count. We did not determine the cash incentives contributed. Although, we could have attempted further transformations to find a strong relationship.

When can we use this model?

We should only use this model to predict for countries within the data set and for the summer Olympics. Keep in mind also that the number of events can change across each games. If we predict for countries well beyond the scope of our explanatory variables (or response), that is called extrapolation.

How do we use the model?

Let’s compare how this model performed for the US medal count (a high value) and Spain’s medal count (a middle value).

# refitting the final model 
olympicsmod2<-lm(WeightedTotal~sqrt(GDP)+TotalAthletes+Distance,data=olympics2024)
round(coef(olympicsmod2),4)

  (Intercept)     sqrt(GDP) TotalAthletes      Distance 
     -15.3150        1.5508        0.0583        0.0029

coef(olympicsmod2)

  (Intercept)     sqrt(GDP) TotalAthletes      Distance 
 -15.31503494    1.55078273    0.05832907    0.00292784

# The long way to get y-hat

# Spain
-15.3150+1.5508*sqrt(1580.694713)+0.0583*382+0.0029*800

[1] 70.93223

# US
-15.3150+1.5508*sqrt(27360.935)+0.0583*593+0.0029*5000

[1] 290.2769

NOTE: THE COEFFICIENTS HAVE BEEN ROUDNED, so there is slight error from the true estimates below. If we use this model and the statistics from the data, we would have predicted the US to have a weighed count of 290.27 and Spain to have a weighed count of 70.93.

# A much easier and exact way
# Spain is observation 20 and the US is observation 23 in the data table
# we can pull values from the model object
olympicsmod2$fitted.values

        1         2         3         4         5         6         7         8 
103.42447  33.88707  89.24731  87.15497 222.70015  25.37977 103.42538 114.57639 
        9        10        11        12        13        14        15        16 
 95.43203  20.07145  28.57378  83.41799 126.82335  23.57755  53.81000  55.29872 
       17        18        19        20        21        22        23        24 
 44.26396  24.07906  66.48759  70.96499  33.08924  18.27204 290.43030  15.61245

olympicsmod2$fitted.values[c(20, 23)]

       20        23 
 70.96499 290.43030

# or we can use the predict function
predict(olympicsmod2)

        1         2         3         4         5         6         7         8 
103.42447  33.88707  89.24731  87.15497 222.70015  25.37977 103.42538 114.57639 
        9        10        11        12        13        14        15        16 
 95.43203  20.07145  28.57378  83.41799 126.82335  23.57755  53.81000  55.29872 
       17        18        19        20        21        22        23        24 
 44.26396  24.07906  66.48759  70.96499  33.08924  18.27204 290.43030  15.61245

predict(olympicsmod2)[c(20, 23)]

       20        23 
 70.96499 290.43030

# Compute Interval Estimates
predict(olympicsmod2,interval="prediction",level=.9)[c(20, 23),]

         fit       lwr     upr
20  70.96499  25.66594 116.264
23 290.43030 237.65258 343.208

# Or we can create a data frame with the new values.
# the explanatory variable names must match those in the model, but there is no response variable.
newolymp<-data.frame(GDP=c(1580.694713,27360.935),TotalAthletes=c(382,593),Distance=c(800,5000))
newolymp

predict(object=olympicsmod2,newdata=newolymp, interval="prediction",level=.9)

        fit       lwr      upr
1  70.96489  25.66583 116.2639
2 290.43046 237.65273 343.2082

If we use this model and the statistics from the data table, we would have predicted Spain to have a weighted medal count of 70.96 exactly. Further, we are 90% confident that Spain to have a weighted medal count between 25.66594 and 116.264, given their GDP of 1580.69, 382 total athletes, and being 800 miles from Paris.
If we use this model and the statistics from the data table, we would have predicted the US to have a weighted medal count of 290.43 exactly. Further, we are 90% confident that the US to have a weighted medal count between 237.65258 and 343.208, given their GDP of 27360.935, 593 total athletes, and being 5000 miles from Paris.
The US actually had a weighted total of 290 and Spain had an actual total of 37. How accurate was this model at predicting?
Although this model did well for prediction because our confidence interval contained the realized response, and overall, this model is good for predicting, based on the assessment metrics, we should not be used for prediction beyond these data. We should update this model with recent data if we want to continue using it and want to consider fine tuning to account for the skew in the response.

Estimation

Estimation is used when we want to make a guess about the average of several observations that all have the same set of predictors.

#estimation
# to compute confidence intervals for estimation we would use interval="confidence"
predict(olympicsmod2,interval="confidence",level=.9)[c(20, 23),]

         fit       lwr       upr
20  70.96499  56.45606  85.47392
23 290.43030 259.70584 321.15476

We are 90% confident that the countries with identical metrics as Spain (GDP of 1580.69, 382 total athletes, and being 800 miles from Paris) will have a average of a weighted medal count between 56.45606 and 85.47392.
We are 90% confident that the countries with identical metrics as the US (GDP of 27360.935, 593 total athletes, and being 5000 miles from Paris) will have a average of a weighted medal count between 259.70584 and 321.15476.

STAT 3220 Unit 1