Winning big at the Olympics brings national prestige and an increased presence on the world stage for a country. This leads to a country to increase its image and popularity, in turn could cause increases in tourism a potential bid to host the Olympics, and other strong political relationship. Therefore, it may be useful to predict how many medals a country would win. What factors might impact a country’s success?
Filipino gymnast, Carlos Yulo, made headlines as the first male to win an Olympic gold medal in the history of the Philippines. But he continued to grow in popularity as his earnings from the country caught attention. He earned a cash prize of 10 million Philippine pesos ($172,519 USD), a brand new condo valued at over $400,000 USD, a lifetime supply of ramen and other incentives. In comparison, the US awards its gold medalists a “measly” $38,000 for winning the gold. While governments in Great Britain and Sweden do not offer any direct cash incentives. In this unit we will explore data from the 2024 Paris Olympics, how cash incentives and national factors impact medal counts for each country. We will explore the countries with the top 24 medal counts.
# Load in the data from a csv
olympics2024<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/olympicsclean2024.csv")
# Produce the first 6 rows of the data
head(olympics2024)
# Create a histogram of the response
hist(olympics2024$WeightedTotal, xlab="Weighted Medal Count", main="Histogram of Weighted Medal Count")
ADD YOUR NOTES ON THE SHAPE + DISTRIBUTION OF RESPONSE:
We explore the scatter plots and correlations for each explanatory variable then classify the relationships as linear, curvilinear, other, or none.
As a class, let’s review some variables of interest together: Gold Prize, GDP, and Women Labor Force.
# Generate the names and positions of the columns in the data
names(olympics2024)
[1] "Country" "WeightedTotal" "Total" "TGold"
[5] "TSilver" "TBronze" "GoldPrize" "SilverPrize"
[9] "BronzePrize" "YOGold" "YOSilver" "YOBronze"
[13] "YOTotal" "GDP" "Population" "WomanLaborForce"
[17] "TotalAthletes" "Distance"
# Create one scatter plot
# plot(x,y)
plot(olympics2024$GoldPrize, olympics2024$WeightedTotal,xlab="Gold Winning Cash Incentive",ylab="Weighted Total Medal Count")
plot(olympics2024$GDP, olympics2024$WeightedTotal,xlab="GDP",ylab="Weighted Total Medal Count")
plot(olympics2024$WomanLaborForce, olympics2024$WeightedTotal,xlab="Woman Labor Force",ylab="Weighted Total Medal Count")
# We can also make several plots at once:
# plot(y~x1+x2+x3...,data)
plot(WeightedTotal~GoldPrize+GDP+WomanLaborForce,data=olympics2024)
# We can make these plots even cleaner by plotting them within one window.
# the par() functions allows us to put plots on the same grid
# don't worry too much about syntax
# mfrow=c(ROWS, COLUMNS), pin=size of plot in inches
# mar = margins of plotting area
par(mfrow=c(1,3),pin=c(1.5,1.5),oma=c(3,3,3,3))
# Use a loop to make a lot of plots at once
for (i in names(olympics2024)[c(7,14,16)]) {
plot(olympics2024[,i], olympics2024$WeightedTotal,xlab=i,ylab="Weighted Total Medal Count")
}
mtext(text="Weighted Total Medal Count",side=2,line=0,outer=TRUE)
# compute correlation simply for one variable with response
cor(olympics2024$GoldPrize,olympics2024$WeightedTotal)
[1] 0.3516467
cor(olympics2024$GDP,olympics2024$WeightedTotal)
[1] 0.8974229
cor(olympics2024$WomanLaborForce,olympics2024$WeightedTotal)
[1] 0.2092121
# compute correlations for all variables and round to three decimal places
round(cor(olympics2024[c(7,14,16)],olympics2024[,2]),3)
[,1]
GoldPrize 0.352
GDP 0.897
WomanLaborForce 0.209
We see that both Gold Prize and Woman labor Force have weak correlations (below about 0.35). In combination with the scatter plots there does not appear to be a non-linear relationship that would make that relationship stronger.
We see that GDP does have a strong correlation (almost 0.89). In the scatter plot, we see there are two countries with very extreme values. What are they?
# Extract the countries with the large GDPs
olympics2024$Country[olympics2024$GDP>15000]
[1] "China" "United States"
These observations are pulling the SLR so strongly that we are over-fitting the linear model. How can we minimize the impact of these countries? And potentially create a more representative relationship? Transform the explanatory variable GDP!
# We can transform variables directly into the plot and cor functions without changing the variables
#log()
plot(log(olympics2024$GDP), olympics2024$WeightedTotal,xlab="Log of GDP",ylab="Weighted Total Medal Count")
cor(log(olympics2024$GDP),olympics2024$WeightedTotal)
[1] 0.8186216
#sqrt()
plot(sqrt(olympics2024$GDP), olympics2024$WeightedTotal,xlab="Square Root of GDP",ylab="Weighted Total Medal Count")
cor(sqrt(olympics2024$GDP),olympics2024$WeightedTotal)
[1] 0.9301069
# We can also transform the response
plot(log(olympics2024$GDP), log(olympics2024$WeightedTotal),xlab="Log of GDP",ylab="Log of Weighted Total Medal Count")
cor(log(olympics2024$GDP),log(olympics2024$WeightedTotal))
[1] 0.7853769
From this preliminary analysis, we see that perhaps the square root transformation of GDP will produce a linear relationship for these data.
Important notes on transformations:
TL;DR - Our goal through the exploratory data analysis phase is to identify the univariate relationships between the explanatory variables and the response. We do this by creating scatter plots and correlations for each variable then attempting any necessary transformations to find the most linear relationships that best fit all of the data. Then classify these relationships and select the variables with the strongest relationships. We will learn a more objective way to do this.
From these three variables- Gold Prize, GDP, and Woman Labor Force, we would write the hypothesized linear model as:
\(WeightedTotal=\beta_0+\beta_1 sqrt(GDP)+\beta_2 GoldPrize+\beta_3 WomanLaborForce+\epsilon\)
YOUR TURN: Go to Unit 1 Classwork.
Add your notes about the remaining 9 variables.
Variable Name | Description of relationships and transformations |
---|---|
GoldPrize | |
SilverPrize | |
BronzePrize | |
YOGold | |
YOSilver | |
YOBronze | |
YOTotal | |
GDP | |
Population | |
WomanLaborForce | |
TotalAthletes | |
Distance |
# All scatter plots
par(mfrow=c(2,4),pin=c(1,1),mar=c(7.1,4.1,3,1))
# Use a loop to make a lot of plots at once
for (i in names(olympics2024)[7:18]) {
plot((olympics2024[,i]), (olympics2024$WeightedTotal),xlab=i,ylab="Log Weighted Total Medal Count")
}
# compute correlations for all variables and round to three decimal places
round(cor((olympics2024[7:18]),(olympics2024$WeightedTotal)),3)
[,1]
GoldPrize 0.352
SilverPrize 0.318
BronzePrize 0.230
YOGold 0.481
YOSilver 0.479
YOBronze 0.491
YOTotal 0.547
GDP 0.897
Population 0.634
WomanLaborForce 0.209
TotalAthletes 0.682
Distance 0.217
After all EDA, we may decide to fit the following model (You likely chose different variables):
We will keep the un-transformed response because we still have variables with strong relationships.
\(WeightedTotal=\beta_0+\beta_1 \sqrt{GDP}+\beta_2 TotalAthletes+\beta_3 Distance+\beta_4 Distance^2+\epsilon\)
# store the model as an object
# We will use the naming convention DataNameMod1, etc
# lm(RESPONSE~x1+...,data=DATAFRAME)
# we can add transformations directly into the lm function
# to add a higher order term, we must use the I() syntax as below.
olympicsmod1<-lm(WeightedTotal~sqrt(GDP)+TotalAthletes+Distance +I(Distance ^2),data=olympics2024)
summary(olympicsmod1)
Call:
lm(formula = WeightedTotal ~ sqrt(GDP) + TotalAthletes + Distance +
I(Distance^2), data = olympics2024)
Residuals:
Min 1Q Median 3Q Max
-47.067 -13.973 0.616 16.911 33.705
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.483e+00 1.362e+01 -0.623 0.541
sqrt(GDP) 1.658e+00 2.267e-01 7.314 6.18e-07 ***
TotalAthletes 3.758e-02 4.927e-02 0.763 0.455
Distance -1.952e-03 5.897e-03 -0.331 0.744
I(Distance^2) 4.817e-07 5.578e-07 0.863 0.399
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.04 on 19 degrees of freedom
Multiple R-squared: 0.8938, Adjusted R-squared: 0.8715
F-statistic: 39.99 on 4 and 19 DF, p-value: 5.302e-09
# Extract just the coefficients
olympicsmod1$coefficients
(Intercept) sqrt(GDP) TotalAthletes Distance I(Distance^2)
-8.482918e+00 1.658033e+00 3.758207e-02 -1.951541e-03 4.816828e-07
# Or use the coef function and round them 4 decimal places
round(coef(olympicsmod1),4)
(Intercept) sqrt(GDP) TotalAthletes Distance I(Distance^2)
-8.4829 1.6580 0.0376 -0.0020 0.0000
# Compute MSE from estimated standard error
sigma(olympicsmod1)^2
[1] 627.0386
The prediction equation is:
\(\widehat{WeightedTotal}=-8.4829+1.658\sqrt{GDP}+0.0376TotalAthletes-0.002Distance+.0000Distance^2\)
\(WeightedTotal=\beta_0+\beta_1 sqrt(GDP)+\beta_2 TotalAthletes+\beta_3 Distance+\beta_4 Distance^2+\epsilon\)
Refitting the model can change significance of other variables and will always change the beta estimates. Here the estimates changed only slightly because the term we removed was not significant.
#refit the model by removing distance^2
olympicsmod2<-lm(WeightedTotal~sqrt(GDP)+TotalAthletes+Distance,data=olympics2024)
summary(olympicsmod2)
Call:
lm(formula = WeightedTotal ~ sqrt(GDP) + TotalAthletes + Distance,
data = olympics2024)
Residuals:
Min 1Q Median 3Q Max
-53.247 -10.839 0.496 17.119 34.575
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.315035 11.018564 -1.390 0.1798
sqrt(GDP) 1.550783 0.188425 8.230 7.5e-08 ***
TotalAthletes 0.058329 0.042741 1.365 0.1875
Distance 0.002928 0.001675 1.748 0.0958 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 24.88 on 20 degrees of freedom
Multiple R-squared: 0.8897, Adjusted R-squared: 0.8731
F-statistic: 53.75 on 3 and 20 DF, p-value: 9.388e-10
round(coef(olympicsmod2),4)
(Intercept) sqrt(GDP) TotalAthletes Distance
-15.3150 1.5508 0.0583 0.0029
\(\widehat{WeightedTotal}=-15.3150+1.5508\sqrt{GDP}+0.0583TotalAthletes+0.0029Distance\)
We see from the global F test this is significant. Although Total Athletes is not significant individually, we will keep it in the model.
# confidence intervals for betas
# confint(model, level=.95)
round(confint(olympicsmod2,level=.9),4)
5 % 95 %
(Intercept) -34.3190 3.6889
sqrt(GDP) 1.2258 1.8758
TotalAthletes -0.0154 0.1320
Distance 0.0000 0.0058
We will cover this in Unit 3
Prediction is used when we want to predict a single observation from its set of predictors. We are using prediction here because we are making a guess about the outcome of a single country.
Back to our questions- What factors might impact a country’s success?
From our simple assessment we determined square root of GDP, number of athletes, and distance to Paris contribute to a country’s weighted medal count. We did not determine the cash incentives contributed. Although, we could have attempted further transformations to find a strong relationship.
When can we use this model?
We should only use this model to predict for countries within the data set and for the summer Olympics. Keep in mind also that the number of events can change across each games. If we predict for countries well beyond the scope of our explanatory variables (or response), that is called extrapolation.
How do we use the model?
Let’s compare how this model performed for the US medal count (a high value) and Spain’s medal count (a middle value).
# refitting the final model
olympicsmod2<-lm(WeightedTotal~sqrt(GDP)+TotalAthletes+Distance,data=olympics2024)
round(coef(olympicsmod2),4)
(Intercept) sqrt(GDP) TotalAthletes Distance
-15.3150 1.5508 0.0583 0.0029
coef(olympicsmod2)
(Intercept) sqrt(GDP) TotalAthletes Distance
-15.31503494 1.55078273 0.05832907 0.00292784
# The long way to get y-hat
# Spain
-15.3150+1.5508*sqrt(1580.694713)+0.0583*382+0.0029*800
[1] 70.93223
# US
-15.3150+1.5508*sqrt(27360.935)+0.0583*593+0.0029*5000
[1] 290.2769
# A much easier and exact way
# Spain is observation 20 and the US is observation 23 in the data table
# we can pull values from the model object
olympicsmod2$fitted.values
1 2 3 4 5 6 7 8
103.42447 33.88707 89.24731 87.15497 222.70015 25.37977 103.42538 114.57639
9 10 11 12 13 14 15 16
95.43203 20.07145 28.57378 83.41799 126.82335 23.57755 53.81000 55.29872
17 18 19 20 21 22 23 24
44.26396 24.07906 66.48759 70.96499 33.08924 18.27204 290.43030 15.61245
olympicsmod2$fitted.values[c(20, 23)]
20 23
70.96499 290.43030
# or we can use the predict function
predict(olympicsmod2)
1 2 3 4 5 6 7 8
103.42447 33.88707 89.24731 87.15497 222.70015 25.37977 103.42538 114.57639
9 10 11 12 13 14 15 16
95.43203 20.07145 28.57378 83.41799 126.82335 23.57755 53.81000 55.29872
17 18 19 20 21 22 23 24
44.26396 24.07906 66.48759 70.96499 33.08924 18.27204 290.43030 15.61245
predict(olympicsmod2)[c(20, 23)]
20 23
70.96499 290.43030
# Compute Interval Estimates
predict(olympicsmod2,interval="prediction",level=.9)[c(20, 23),]
fit lwr upr
20 70.96499 25.66594 116.264
23 290.43030 237.65258 343.208
# Or we can create a data frame with the new values.
# the explanatory variable names must match those in the model, but there is no response variable.
newolymp<-data.frame(GDP=c(1580.694713,27360.935),TotalAthletes=c(382,593),Distance=c(800,5000))
newolymp
predict(object=olympicsmod2,newdata=newolymp, interval="prediction",level=.9)
fit lwr upr
1 70.96489 25.66583 116.2639
2 290.43046 237.65273 343.2082
If we use this model and the statistics from the data table, we would have predicted Spain to have a weighted medal count of 70.96 exactly. Further, we are 90% confident that Spain to have a weighted medal count between 25.66594 and 116.264, given their GDP of 1580.69, 382 total athletes, and being 800 miles from Paris.
If we use this model and the statistics from the data table, we would have predicted the US to have a weighted medal count of 290.43 exactly. Further, we are 90% confident that the US to have a weighted medal count between 237.65258 and 343.208, given their GDP of 27360.935, 593 total athletes, and being 5000 miles from Paris.
The US actually had a weighted total of 290 and Spain had an actual total of 37. How accurate was this model at predicting?
Although this model did well for prediction because our confidence interval contained the realized response, and overall, this model is good for predicting, based on the assessment metrics, we should not be used for prediction beyond these data. We should update this model with recent data if we want to continue using it and want to consider fine tuning to account for the skew in the response.
Estimation is used when we want to make a guess about the average of several observations that all have the same set of predictors.
#estimation
# to compute confidence intervals for estimation we would use interval="confidence"
predict(olympicsmod2,interval="confidence",level=.9)[c(20, 23),]
fit lwr upr
20 70.96499 56.45606 85.47392
23 290.43030 259.70584 321.15476
We are 90% confident that the countries with identical metrics as Spain (GDP of 1580.69, 382 total athletes, and being 800 miles from Paris) will have a average of a weighted medal count between 56.45606 and 85.47392.
We are 90% confident that the countries with identical metrics as the US (GDP of 27360.935, 593 total athletes, and being 5000 miles from Paris) will have a average of a weighted medal count between 259.70584 and 321.15476.