Winning big at the Olympics brings national prestige and an increased presence on the world stage for a country. This leads to a country to increase its image and popularity, in turn could cause increases in tourism a potential bid to host the Olympics, and other strong political relationship. Therefore, it may be useful to predict how many medals a country would win. What factors might impact a country’s success?
Filipino gymnast, Carlos Yulo, made headlines as the first male to win an Olympic gold medal in the history of the Philippines. But he continued to grow in popularity as his earnings from the country caught attention. He earned a cash prize of 10 million Philippine pesos ($172,519 USD), a brand new condo valued at over $400,000 USD, a lifetime supply of ramen and other incentives. In comparison, the US awards its gold medalists a “measly” $38,000 for winning the gold. While governments in Great Britain and Sweden do not offer any direct cash incentives. In this unit we will explore data from the 2024 Paris Olympics, how cash incentives and national factors impact medal counts for each country. We will explore the countries with the top 24 medal counts.
# Load in the data from a csv
olympics2024<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/olympics2024SP26.csv")
# Produce the first 6 rows of the data
head(olympics2024)
# Create a histogram of the response
hist(olympics2024$WeightedTotal, xlab="Weighted Medal Count", main="Histogram of Weighted Medal Count")
ADD YOUR NOTES ON THE SHAPE + DISTRIBUTION OF RESPONSE:
We explore the scatter plots and correlations for each explanatory variable then classify the relationships as linear, curvilinear, other, or none.
As a class, let’s review some variables of interest together: Gold Prize, GDP, and Women Labor Force.
# Generate the names and positions of the columns in the data
names(olympics2024)
[1] "Country" "WeightedTotal" "Total" "TGold"
[5] "TSilver" "TBronze" "GoldPrize" "YOTotal"
[9] "GDP" "Population" "WomanLaborForce" "TotalAthletes"
[13] "Distance"
# Create one scatter plot
# plot(x,y)
plot(olympics2024$GoldPrize, olympics2024$WeightedTotal,xlab="Gold Winning Cash Incentive",ylab="Weighted Total Medal Count")
plot(olympics2024$GDP, olympics2024$WeightedTotal,xlab="GDP",ylab="Weighted Total Medal Count")
plot(olympics2024$WomanLaborForce, olympics2024$WeightedTotal,xlab="Woman Labor Force",ylab="Weighted Total Medal Count")
# We can also make several plots at once:
# plot(y~x1+x2+x3...,data)
plot(WeightedTotal~GoldPrize+GDP+WomanLaborForce,data=olympics2024)
See the markdown file for the coding syntax.
# compute correlation simply for one variable with response
cor(olympics2024$GoldPrize,olympics2024$WeightedTotal)
[1] 0.3516467
cor(olympics2024$GDP,olympics2024$WeightedTotal)
[1] 0.8974232
cor(olympics2024$WomanLaborForce,olympics2024$WeightedTotal)
[1] 0.2092479
# compute correlations for all variables and round to three decimal places
round(cor(olympics2024[c(7,9,11)],olympics2024[,2]),3)
[,1]
GoldPrize 0.352
GDP 0.897
WomanLaborForce 0.209
We see that both Gold Prize and Woman labor Force have weak correlations (below about 0.35). In combination with the scatter plots there does not appear to be a non-linear relationship that would make that relationship stronger.
We see that GDP does have a strong correlation (almost 0.89). In the scatter plot, we see there are two countries with very extreme values. What are they?
# Extract the countries with the large GDPs
olympics2024$Country[olympics2024$GDP>15000]
[1] "China" "United States"
These observations are pulling the SLR so strongly that we are over-fitting the linear model. How can we minimize the impact of these countries? And potentially create a more representative relationship? Transform the explanatory variable GDP!
# We can transform variables directly into the plot and cor functions without changing the variables
#log()
plot(log(olympics2024$GDP), olympics2024$WeightedTotal,xlab="Log of GDP",ylab="Weighted Total Medal Count")
cor(log(olympics2024$GDP),olympics2024$WeightedTotal)
[1] 0.8186207
#sqrt()
plot(sqrt(olympics2024$GDP), olympics2024$WeightedTotal,xlab="Square Root of GDP",ylab="Weighted Total Medal Count")
cor(sqrt(olympics2024$GDP),olympics2024$WeightedTotal)
[1] 0.930107
# We can also transform the response
plot(log(olympics2024$GDP), log(olympics2024$WeightedTotal),xlab="Log of GDP",ylab="Log of Weighted Total Medal Count")
cor(log(olympics2024$GDP),log(olympics2024$WeightedTotal))
[1] 0.7853723
From this preliminary analysis, we see that perhaps the square root transformation of GDP will produce a linear relationship for these data.
Important notes on transformations:
TL;DR - Our goal through the exploratory data analysis phase is to identify the univariate relationships between the explanatory variables and the response. We do this by creating scatter plots and correlations for each variable then attempting any necessary transformations to find the most linear relationships that best fit all of the data. Then classify these relationships and select the variables with the strongest relationships. We will learn a more objective way to do this.
From these three variables- Gold Prize, GDP, and Woman Labor Force, we would write the hypothesized linear model as:
\(WeightedTotal=\beta_0+\beta_1 \sqrt{GDP}+\beta_2 GoldPrize+\beta_3 WomanLaborForce+\epsilon\)
YOUR TURN
Add your notes about the remaining 9 variables.
| Variable Name | Description of relationships and transformations |
|---|---|
| GoldPrize | |
| YOTotal | |
| GDP | |
| Population | |
| WomanLaborForce | |
| TotalAthletes | |
| Distance |
# All scatter plots
par(mfrow=c(2,2),pin=c(1,1),mar=c(7.1,4.1,3,1))
# Use a loop to make a lot of plots at once
for (i in names(olympics2024)[7:13]) {
plot((olympics2024[,i]), (olympics2024$WeightedTotal),xlab=i,ylab="Log Weighted Total Medal Count")
}
# compute correlations for all variables and round to three decimal places
round(cor((olympics2024[7:13]),(olympics2024$WeightedTotal)),3)
[,1]
GoldPrize 0.352
YOTotal 0.547
GDP 0.897
Population 0.634
WomanLaborForce 0.209
TotalAthletes 0.682
Distance 0.217
After all EDA, we may decide to fit the following model (You likely chose different variables):
We will keep the un-transformed response because we still have variables with strong relationships.
\(WeightedTotal=\beta_0+\beta_1 \sqrt{GDP}+\beta_2 TotalAthletes+\beta_3 Distance+\beta_4 Distance^2+\epsilon\)
# store the model as an object
# We will use the naming convention DataNameMod1, etc
# lm(RESPONSE~x1+...,data=DATAFRAME)
# we can add transformations directly into the lm function
# to add a higher order term, we must use the I() syntax as below.
olympicsmod1<-lm(WeightedTotal~sqrt(GDP)+TotalAthletes+Distance +I(Distance ^2),data=olympics2024)
summary(olympicsmod1)
Call:
lm(formula = WeightedTotal ~ sqrt(GDP) + TotalAthletes + Distance +
I(Distance^2), data = olympics2024)
Residuals:
Min 1Q Median 3Q Max
-47.066 -13.973 0.616 16.912 33.705
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.482e+00 1.362e+01 -0.623 0.541
sqrt(GDP) 1.658e+00 2.267e-01 7.314 6.18e-07 ***
TotalAthletes 3.758e-02 4.927e-02 0.763 0.455
Distance -1.952e-03 5.897e-03 -0.331 0.744
I(Distance^2) 4.817e-07 5.578e-07 0.864 0.399
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.04 on 19 degrees of freedom
Multiple R-squared: 0.8938, Adjusted R-squared: 0.8715
F-statistic: 39.99 on 4 and 19 DF, p-value: 5.302e-09
# Extract just the coefficients
olympicsmod1$coefficients
(Intercept) sqrt(GDP) TotalAthletes Distance I(Distance^2)
-8.482254e+00 1.658039e+00 3.758050e-02 -1.951855e-03 4.817133e-07
# Or use the coef function and round them 4 decimal places
round(coef(olympicsmod1),4)
(Intercept) sqrt(GDP) TotalAthletes Distance I(Distance^2)
-8.4823 1.6580 0.0376 -0.0020 0.0000
# Compute MSE from estimated standard error
sigma(olympicsmod1)^2
[1] 627.0353
The prediction equation is:
\(\widehat{WeightedTotal}=-8.4829+1.658\sqrt{GDP}+0.0376TotalAthletes-0.002Distance+.0000Distance^2\)
\(WeightedTotal=\beta_0+\beta_1 sqrt(GDP)+\beta_2 TotalAthletes+\beta_3 Distance+\beta_4 Distance^2+\epsilon\)
Refitting the model can change significance of other variables and will always change the beta estimates. Here the estimates changed only slightly because the term we removed was not significant.
#refit the model by removing distance^2
olympicsmod2<-lm(WeightedTotal~sqrt(GDP)+TotalAthletes+Distance,data=olympics2024)
summary(olympicsmod2)
Call:
lm(formula = WeightedTotal ~ sqrt(GDP) + TotalAthletes + Distance,
data = olympics2024)
Residuals:
Min 1Q Median 3Q Max
-53.247 -10.840 0.494 17.119 34.575
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.314818 11.018560 -1.390 0.1798
sqrt(GDP) 1.550781 0.188425 8.230 7.5e-08 ***
TotalAthletes 0.058329 0.042741 1.365 0.1875
Distance 0.002928 0.001675 1.748 0.0958 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 24.88 on 20 degrees of freedom
Multiple R-squared: 0.8897, Adjusted R-squared: 0.8731
F-statistic: 53.75 on 3 and 20 DF, p-value: 9.388e-10
round(coef(olympicsmod2),4)
(Intercept) sqrt(GDP) TotalAthletes Distance
-15.3148 1.5508 0.0583 0.0029
\(\widehat{WeightedTotal}=-15.3150+1.5508\sqrt{GDP}+0.0583TotalAthletes+0.0029Distance\)
We see from the global F test this is significant. Although Total Athletes is not significant individually, we will keep it in the model.
# confidence intervals for betas
# confint(model, level=.95)
round(confint(olympicsmod2,level=.9),4)
5 % 95 %
(Intercept) -34.3187 3.6891
sqrt(GDP) 1.2258 1.8758
TotalAthletes -0.0154 0.1320
Distance 0.0000 0.0058
We will cover this in Unit 3
Back to our questions- What factors might impact a country’s success?
From our simple assessment we determined square root of GDP, number of athletes, and distance to Paris contribute to a country’s weighted medal count. We did not determine the cash incentives contributed. Although, we could have attempted further transformations to find a strong relationship.
When can we use this model?
We should only use this model to predict for countries within the data set and for the summer Olympics. Keep in mind also that the number of events can change across each games. If we predict for countries well beyond the scope of our explanatory variables (or response), that is called extrapolation.
How do we use the model?
Let’s compare how this model performed for the US medal count (a high value) and Spain’s medal count (a middle value).
Prediction is used when we want to predict a single observation from its set of predictors. We are using prediction here because we are making a guess about the outcome of a single country.
# refitting the final model
olympicsmod2<-lm(WeightedTotal~sqrt(GDP)+TotalAthletes+Distance,data=olympics2024)
round(coef(olympicsmod2),4)
(Intercept) sqrt(GDP) TotalAthletes Distance
-15.3148 1.5508 0.0583 0.0029
# The long way to get y-hat
# Using R as a calculator
# Spain
-15.3148+1.5508*sqrt(1580.694713)+0.0583*382+0.0029*800
[1] 70.93243
# US
-15.3148+1.5508*sqrt(27360.935)+0.0583*593+0.0029*5000
[1] 290.2771
# Spain is observation 20 and the US is observation 23 in the data table
predict(olympicsmod2)
1 2 3 4 5 6 7 8
103.42503 33.88773 89.24682 87.15481 222.69995 25.37989 103.42548 114.57620
9 10 11 12 13 14 15 16
95.43250 20.07102 28.57394 83.41727 126.82392 23.58076 53.81069 55.29717
17 18 19 20 21 22 23 24
44.26488 24.07933 66.48752 70.96498 33.08836 18.26972 290.43027 15.61174
predict(olympicsmod2)[c(20, 23)]
20 23
70.96498 290.43027
# Compute Interval Estimates
predict(olympicsmod2,interval="prediction",level=.9)[c(20, 23),]
fit lwr upr
20 70.96498 25.66594 116.264
23 290.43027 237.65258 343.208
# First create a data frame with the new values.
# the explanatory variable names must match those in the model, but there is no response variable.
newolymp<-data.frame(GDP=c(1580.694713,27360.935),TotalAthletes=c(382,593),Distance=c(800,5000))
newolymp
predict(object=olympicsmod2,newdata=newolymp, interval="prediction",level=.9)
fit lwr upr
1 70.96498 25.66594 116.264
2 290.43027 237.65258 343.208
If we use this model and the statistics from the data table, we would have predicted Spain to have a weighted medal count of 70.96 exactly. Further, we are 90% confident that Spain to have a weighted medal count between 25.66594 and 116.264, given their GDP of 1580.69, 382 total athletes, and being 800 miles from Paris.
If we use this model and the statistics from the data table, we would have predicted the US to have a weighted medal count of 290.43 exactly. Further, we are 90% confident that the US to have a weighted medal count between 237.65258 and 343.208, given their GDP of 27360.935, 593 total athletes, and being 5000 miles from Paris.
The US actually had a weighted total of 290 and Spain had an actual total of 37. How accurate was this model at predicting?
Although this model did well for prediction because our confidence interval contained the realized response, and overall, this model is good for predicting, based on the assessment metrics, we should not be used for prediction beyond these data. We should update this model with recent data if we want to continue using it and want to consider fine tuning to account for the skew in the response.
Estimation is used when we want to make a guess about the average of several observations that all have the same set of predictors.
#estimation
# to compute confidence intervals for estimation we would use interval="confidence"
predict(olympicsmod2,interval="confidence",level=.9)[c(20, 23),]
fit lwr upr
20 70.96498 56.45605 85.47391
23 290.43027 259.70583 321.15472
We are 90% confident that the countries with identical metrics as Spain (GDP of 1580.69, 382 total athletes, and being 800 miles from Paris) will have a average of a weighted medal count between 56.45606 and 85.47392.
We are 90% confident that the countries with identical metrics as the US (GDP of 27360.935, 593 total athletes, and being 5000 miles from Paris) will have a average of a weighted medal count between 259.70584 and 321.15476.