Winning big at the Olympics brings national prestige and an increased presence on the world stage for a country. This leads to a country to increase its image and popularity, in turn could cause increases in tourism a potential bid to host the Olympics, and other strong political relationship. Therefore, it may be useful to predict how many medals a country would win. What factors might impact a country’s success?
Filipino gymnast, Carlos Yulo, made headlines as the first male to win an Olympic gold medal in the history of the Philippines. But he continued to grow in popularity as his earnings from the country caught attention. He earned a cash prize of 10 million Philippine pesos ($172,519 USD), a brand new condo valued at over $400,000 USD, a lifetime supply of ramen and other incentives. In comparison, the US awards its gold medalists a “measly” $38,000 for winning the gold. While governments in Great Britain and Sweden do not offer any direct cash incentives. In this unit we will explore data from the 2024 Paris Olympics, how cash incentives and national factors impact medal counts for each country. We will explore the countries with the top 24 medal counts.
# Load in the data from a csv
olympics2024<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/olympicsclean2024.csv")
# Produce the first 6 rows of the data
head(olympics2024)
# Create a histogram of the response
hist(olympics2024$WeightedTotal, xlab="Weighted Medal Count", main="Histogram of Weighted Medal Count")
# We can transform the response to reduce the skew
hist(log(olympics2024$WeightedTotal), xlab="Log of Weighted Medal Count", main="Histogram of log(Weighted Medal Count)")
ADD YOUR NOTES ON THIS SHAPE:
We explore the scatter plots and correlations for each explanatory variable then classify the relationships as linear, curvilinear, or none.
# Generate the names and positions of the columns in the data
names(olympics2024)
[1] "Country" "WeightedTotal" "Total" "TGold"
[5] "TSilver" "TBronze" "GoldPrize" "SilverPrize"
[9] "BronzePrize" "YOGold" "YOSilver" "YOBronze"
[13] "YOTotal" "GDP" "Population" "WomanLaborForce"
[17] "TotalAthletes" "Distance"
# Create one scatter plot
# plot(x,y)
plot(olympics2024$GoldPrize, olympics2024$WeightedTotal,xlab="Gold Winning Cash Incentive",ylab="Weighted Total Medal Count")
plot(log(olympics2024$GoldPrize), log(olympics2024$WeightedTotal),xlab="Gold Winning Cash Incentive",ylab="Weighted Total Medal Count")
# Create several plots at once
# plot(y~x+x+...,data=DATATABLENAME)
plot(WeightedTotal~log(GoldPrize)+SilverPrize,data=olympics2024)
# Create your plots in a grid
# the par() functions allows us to put plots on the same grid
# don't worry too much about syntax
# mfrow=c(ROWS, COLUMNS), pin=size of plot in inches
# mar = margins of plotting area
par(mfrow=c(2,4),pin=c(1,1),mar=c(7.1,4.1,3,1))
# Use a loop to make a lot of plots at once
for (i in names(olympics2024)[7:18]) {
plot(olympics2024[,i], olympics2024$WeightedTotal,xlab=i,ylab="Weighted Total Medal Count")
}
# compute correlation simply for one variable with response
cor(olympics2024$GoldPrize,olympics2024$WeightedTotal)
[1] 0.3516467
# compute correlations for all variables and round to three decimal places
round(cor(olympics2024[7:18],olympics2024$WeightedTotal),3)
[,1]
GoldPrize 0.352
SilverPrize 0.318
BronzePrize 0.230
YOGold 0.481
YOSilver 0.479
YOBronze 0.491
YOTotal 0.547
GDP 0.897
Population 0.634
WomanLaborForce 0.209
TotalAthletes 0.682
Distance 0.217
ADD YOUR NOTES ON THE RELATIONSHIPS WITH UNTRANSFORMED RESPONSE:
# Should we use log(weightedtotal)?
par(mfrow=c(2,4),pin=c(1,1),mar=c(7.1,4.1,3,1))
# Use a loop to make a lot of plots at once
for (i in names(olympics2024)[7:18]) {
plot(olympics2024[,i], log(olympics2024$WeightedTotal),xlab=i,ylab="Log Weighted Total Medal Count")
}
# compute correlations for all variables and round to three decimal places
round(cor(olympics2024[7:18],log(olympics2024$WeightedTotal)),3)
[,1]
GoldPrize 0.260
SilverPrize 0.220
BronzePrize 0.139
YOGold 0.460
YOSilver 0.572
YOBronze 0.586
YOTotal 0.601
GDP 0.687
Population 0.488
WomanLaborForce 0.236
TotalAthletes 0.719
Distance 0.263
ADD YOUR NOTES ON THE RELATIONSHIPS WITH UNTRANSFORMED RESPONSE:
Should we go with transformed or un-transformed?
We will keep the un-transformed response because we still have variables with strong relationships.
\(WeightedTotal=\beta_0+\beta_1 GDP+\beta_2 TotalAthletes+\beta_3 Distance+\beta_4 Distance^2+\epsilon\)
# store the model as an object
# lm(RESPONSE~x1+...,data=DATAFRAME)
olympicsmod1<-lm(WeightedTotal~GDP+TotalAthletes+Distance +I(Distance ^2),data=olympics2024)
summary(olympicsmod1)
Call:
lm(formula = WeightedTotal ~ GDP + TotalAthletes + Distance +
I(Distance^2), data = olympics2024)
Residuals:
Min 1Q Median 3Q Max
-39.003 -15.854 -6.391 12.713 48.558
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.125e+01 1.599e+01 0.704 0.4902
GDP 8.145e-03 1.240e-03 6.566 2.76e-06 ***
TotalAthletes 1.250e-01 4.483e-02 2.787 0.0117 *
Distance 1.535e-03 6.145e-03 0.250 0.8055
I(Distance^2) 9.884e-08 5.777e-07 0.171 0.8660
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 27.05 on 19 degrees of freedom
Multiple R-squared: 0.8761, Adjusted R-squared: 0.85
F-statistic: 33.58 on 4 and 19 DF, p-value: 2.262e-08
# Extract just the coefficients and round them 4 decimal places
round(coef(olympicsmod1),4)
(Intercept) GDP TotalAthletes Distance I(Distance^2)
11.2538 0.0081 0.1250 0.0015 0.0000
# Compute MSE from estimated standard error
sigma(olympicsmod1)^2
[1] 731.8691
The prediction equation is:
\(\widehat{WeightedTotal}=11.25+0.0081GDP+0.125TotalAthletes+0.0015Distance+.0000Distance^2\)
Interpretation of coefficients:
\(WeightedTotal=\beta_0+\beta_1 GDP+\beta_2 TotalAthletes+\beta_3 Distance+\beta_4 Distance^2+\epsilon\)
Refitting the model can change significance of other variables and will always change the beta estimates. Here the estimates changed only slightly because the term we removed was not significant.
olympicsmod2<-lm(WeightedTotal~GDP+TotalAthletes+Distance,data=olympics2024)
summary(olympicsmod2)
Call:
lm(formula = WeightedTotal ~ GDP + TotalAthletes + Distance,
data = olympics2024)
Residuals:
Min 1Q Median 3Q Max
-40.431 -16.397 -6.266 12.581 48.975
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.531074 12.119446 0.786 0.44084
GDP 0.008041 0.001056 7.616 2.47e-07 ***
TotalAthletes 0.128013 0.040113 3.191 0.00459 **
Distance 0.002538 0.001786 1.421 0.17063
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 26.39 on 20 degrees of freedom
Multiple R-squared: 0.8759, Adjusted R-squared: 0.8573
F-statistic: 47.05 on 3 and 20 DF, p-value: 3.023e-09
round(coef(olympicsmod2),4)
(Intercept) GDP TotalAthletes Distance
9.5311 0.0080 0.1280 0.0025
round(confint(olympicsmod2),4)
2.5 % 97.5 %
(Intercept) -15.7496 34.8118
GDP 0.0058 0.0102
TotalAthletes 0.0443 0.2117
Distance -0.0012 0.0063
\(\widehat{WeightedTotal}=9.531+0.008GDP+0.128+0.0025Distance\)
We see from the global F test this is significant, however Distance is not significant in the model. We will remove Distance and refit the model again.
olympicsmod3<-lm(WeightedTotal~GDP+TotalAthletes,data=olympics2024)
summary(olympicsmod3)
Call:
lm(formula = WeightedTotal ~ GDP + TotalAthletes, data = olympics2024)
Residuals:
Min 1Q Median 3Q Max
-41.077 -20.727 -1.128 16.394 45.443
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.369036 10.652346 1.724 0.09932 .
GDP 0.008313 0.001063 7.817 1.19e-07 ***
TotalAthletes 0.121908 0.040840 2.985 0.00706 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 27.02 on 21 degrees of freedom
Multiple R-squared: 0.8633, Adjusted R-squared: 0.8503
F-statistic: 66.34 on 2 and 21 DF, p-value: 8.393e-10
round(coef(olympicsmod3),4)
(Intercept) GDP TotalAthletes
18.3690 0.0083 0.1219
round(confint(olympicsmod3),4)
2.5 % 97.5 %
(Intercept) -3.7837 40.5218
GDP 0.0061 0.0105
TotalAthletes 0.0370 0.2068
We will cover this in Unit 3
Back to our questions- What factors might impact a country’s success?
From our simple assessment we determined GDP and number of athletes contribute to a country’s weighted medal count. We did not determine the cash incentives contributed. Although, we could have attempted further transformations to find a strong relationship.
When can we use this model?
We should only use this model to predict for countries within the data set and for the summer Olympics. Keep in mind also that the number of events can change across each games. If we predict for countries well beyond the scope of our explanatory variables (or response), that is called extrapolation.
How do we use the model?
Let’s compare how this model performed for the US medal count (a high value) and Spain’s medal count (a middle value).
olympicsmod3<-lm(WeightedTotal~GDP+TotalAthletes,data=olympics2024)
round(coef(olympicsmod3),4)
(Intercept) GDP TotalAthletes
18.3690 0.0083 0.1219
# The long way to get y-hat
# US
18.3690+0.0083*27360.935+0.1219*593
[1] 317.7515
# Spain
18.3690+0.0083*1580.694713+0.1219*382
[1] 78.05457
# A much easier and exact way
# Spain is observation 20 and the US is observation 23 in the data table
# we can pull values from the model object
olympicsmod3$fitted.values
1 2 3 4 5 6 7 8
88.77574 43.73900 70.20640 74.68151 201.39789 36.72361 113.29469 107.46504
9 10 11 12 13 14 15 16
85.99667 40.73704 26.46091 83.43788 103.37101 27.79535 61.30981 44.24828
17 18 19 20 21 22 23 24
51.20042 34.33086 34.55722 78.07741 37.56406 36.92241 318.09806 29.60872
olympicsmod3$fitted.values[c(20, 23)]
20 23
78.07741 318.09806
# or we can use the predict function
predict(olympicsmod3)
1 2 3 4 5 6 7 8
88.77574 43.73900 70.20640 74.68151 201.39789 36.72361 113.29469 107.46504
9 10 11 12 13 14 15 16
85.99667 40.73704 26.46091 83.43788 103.37101 27.79535 61.30981 44.24828
17 18 19 20 21 22 23 24
51.20042 34.33086 34.55722 78.07741 37.56406 36.92241 318.09806 29.60872
predict(olympicsmod3)[c(20, 23)]
20 23
78.07741 318.09806
predict(olympicsmod3,interval="prediction")[c(20, 23),]
fit lwr upr
20 78.07741 19.27739 136.8774
23 318.09806 245.08655 391.1096
# Or we can create a data frame with the new values.
# the explanatory variable names must match those in the model, but there is no response variable.
newolymp<-data.frame(GDP=c(27360.935,1580.694713),TotalAthletes=c(593,382))
newolymp
predict(olympicsmod3,newolymp, interval="prediction")
fit lwr upr
1 318.09835 245.08680 391.1099
2 78.07736 19.27734 136.8774
# to compute confidence intervals for estimation we would use interval="confidence"
If we use this model and the statistics from the data table, we would have predicted the US to have a weighted medal count of 318.098 exactly. Further, we are 95% confident that the US to have a weighted medal count between 245.086 and 391.10, given their GDP of 27360.935 and 593 total athletes.
If we use this model and the statistics from the data table, we would have predicted Spain to have a weighted medal count of 78.07 exactly. Further, we are 95% confident that Spain to have a weighted medal count between 19.27 and 136.87, given their GDP of 1580.69 and 382 total athletes.
The US actually had a weighted total of 290 and Spain had an actual total of 37. How accurate was this model at predicting?
Although this model did well for prediction because our confidence interval contained the realized response, and overall, this model is good for predicting, based on the assessment metrics, we should not be used for prediction beyond these data. We should update this model with recent data if we want to continue using it and want to consider fine tuning to account for the skew in the response.