The data that was taken for the study was the Olympic medals data. Olympic games are often mixed with sporting and national competition. Many countries compare with the other countries during the Olympic games. For example higher income countries with the respective higher income groups and within highly populated countries, low income countries etc., The competition is usually measured in terms of the number fo medals won by each country. But in-turn, some countries are small in size , less in population and has less revenue to spend to spend on the people t train to Olympics. Hence, the number of medals are often looked at per capita perspective. Looking at the figure below, we still feel that per capita medal count is also poorly represented.
Total medals won by countries for 2016 Olympics,credits(https://medalspercapita.com/)
Total medals per capita by countries for 2016 Olympics,credits(https://medalspercapita.com/)
For this reason, The GDP of the countries is also considered for the Olympic medals count. Because the wealthiest countries can train its people at its best while the non-wealthiest cannot. So, in order to predict the medals that will be won by countries for the next Olympic, a country’s GDP and population should be considered.
This study was done by collecting the data for 2008, 2012 and 2016 Olympics data. The countries that had won atleast one gold medal are considered. The data consists of 71 countries and the features are country name, its GDP and population and the medals won at 2008,2012 and 2016 Olympics.
In this study, I investigated how the number of medals a country wins can be predicted from national population and GDP, and how consistent these relationships are, using the glm function in R
Perform linear regression for the medal count in 2012 from Population and GDP, and make a prediction for the results of 2016. Explain the model. Report results and comment on your findings. Plot your predictions against the actual results of 2016. If the results are hard to see, use a transformation of the axes to improve clarity. Comment on your findings. How good are the predictions? Which countries are outliers from the trend?
Looking at the summary of the data set using summary function, the descriptive statistics for the numerical variables can be implored.
medals <- read.csv("medal_pop_gdp_data_statlearn.csv")
summary(medals)
## Country GDP Population Medal2008
## Length:71 Min. : 6.52 Min. :3.537e+05 Min. : 1.00
## Class :character 1st Qu.: 51.52 1st Qu.:5.513e+06 1st Qu.: 2.00
## Mode :character Median : 229.53 Median :1.673e+07 Median : 6.00
## Mean : 903.25 Mean :7.384e+07 Mean : 13.11
## 3rd Qu.: 704.37 3rd Qu.:4.958e+07 3rd Qu.: 13.50
## Max. :15094.00 Max. :1.347e+09 Max. :110.00
## Medal2012 Medal2016
## Min. : 1.0 Min. : 1.00
## 1st Qu.: 3.0 1st Qu.: 3.00
## Median : 6.0 Median : 7.00
## Mean : 13.3 Mean : 13.44
## 3rd Qu.: 13.0 3rd Qu.: 15.00
## Max. :104.0 Max. :121.00
The data of the GDP, Population are used as independent variables to train the model and the dependent variable is the medals that are won in 2012. The built model can take the form :
\[ Medals_{won} = \beta_0 + \beta_1(GDP) + \beta_2(Population) + \epsilon , \ \epsilon \sim N((0, \sigma^2)) \]
loglikeli <- rep(NA,4)
df_for_model <- medals[,c(2,3,5)]
colnames(df_for_model) <- c("GDP","Popltn","medals")
model_1 <- glm(formula = medals ~ GDP + Popltn , data = df_for_model)
summary(model_1)
##
## Call:
## glm(formula = medals ~ GDP + Popltn, data = df_for_model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -20.568 -5.961 -2.462 3.932 60.121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.076e+00 1.500e+00 4.051 0.000133 ***
## GDP 7.564e-03 7.325e-04 10.326 1.45e-15 ***
## Popltn 5.247e-09 7.193e-09 0.729 0.468225
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 132.1562)
##
## Null deviance: 28402.8 on 70 degrees of freedom
## Residual deviance: 8986.6 on 68 degrees of freedom
## AIC: 553.19
##
## Number of Fisher Scoring iterations: 2
The model can be interpreted as below : \[ Medals_{won} = 6.076 + 7.564*10^{-3}(GDP) - 5.247*10^{-9}(Population) + \epsilon \] The coefficients \(\beta_0\) and \(\beta_1\) are significant as they have less p-value compared to the parameter of the population variable. Calculating the confidence intervals for these estimates, the estimate lies in the respective intervals.
Intercept : [3.08,9.07]
GDP : \([6.101*10^{-3},9.026*10^{-3}]\)
Population : \([-9.109*10^{-9},1.96.026*10^{-8}]\)
This model represents a high dispersion(variability) of 132.1562 and has AIC value of 553.19 .By substituting in the values for the independent variables(assuming there is no change in GDP and population), the model is capable of predicting the 2016 Olympics results. The ceiling function is used to round the decimal values of medal count to the next integer, this is done because, the medal count cannot be a decimal value.
df_test <- medals[,c(2,3)]
colnames(df_test) <- c("GDP","Popltn")
act_vs_pred_1 <- data.frame(country = medals$Country, GDP = medals$GDP,
Pop = medals$Population, act = medals$Medal2016,
pred=ceiling(predict(model_1,newdata = df_test)))
act_vs_pred_1$diff <- abs(act_vs_pred_1$pred - act_vs_pred_1$act)
The plot of actual values versus predicted values shows how far the predictions are from the actual values. The line represents the coincidence of the actual and predicted values(i.e, same actual and predicted values), the ones above the line indicate that for a country, predicted value is higher than the actual value and the ones below the line indicate the opposite i.e, the actual is higher than the predicted medal count.The axes are log transformed for improving clarity.
The plot shows the countries that have an absolute difference of more than 10 between the predicted medal count and the actual medal count of 2016. The ones with sky blue in colour have a large difference than the ones in dark blue colour. Most of the predictions show value near to the actual values. Countries such as Great Britain and Russian Federation can be considered as outliers as they exhibit a large difference in the actual and predicted values of about 42 and 35 respectively.
Repeat the task 1 for linear regression for log-transformed medal counts. In addition, discuss differences and potential benefits of using the log-transformation compared to modelling raw medal counts. (Hint: For the predictions remember to transform the predicted counts suitably so that you can compare them to the actual medal counts.)
The log transformation is applied on the medal counts and the correspondingly exponential function is applied to the results that are comparable to the medal counts.
model_2 <- glm(formula = log(medals) ~ GDP + Popltn , data = df_for_model )
This model is more robust compared to earlier one because the dispersion parameter for this model is very close to 1 and the residual deviance is also very less compared to the first model.Adding, the AIC value is also greatly reduced indicating a better fit model with a residual deviance of only 63.760 .
The \(\beta_{0}\) , \(\beta_{1}\) , \(\beta_{2}\) for this model are 1.569 , \(3.161*10^{-4}\) , \(1.105*10^{-10}\).
Now, predicting the values of 2016 medal count and plotting these values against the actual medal count.
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_text).
Though the difference bar on the right side of the graph indicates a larger value of about 500, this model is more robust. Because, the only country with such a large difference of about 467 is the united states, but it is considered as an outlier by the model and the corresponding log likelihoods are calculated giving a better fit of the data.
Repeat the task 1 for Poisson regression. (Hint: Remember to specify suitably the family argument for the glm function. For the predict function remember to specify type=“response”.)
The poisson model is specified by the log link .
model_3 <- glm(formula = medals ~ GDP + Popltn ,
data = df_for_model ,
family = poisson(link = "log"))
The poisson model for medal counts (with a log link) is : \[ log(Medals_{won}) = 2.193 + 1.715*10^{-4}(GDP)+ 6.0495*10^{-10}(Population) + \epsilon \] In this model, all the coefficients are significant (less p-values) with a dispersion parameter of 1, but it has slightly higher AIC value of 962.24 indicating a less log likelihood which is not so great compared to earlier models. The residual deviance is about 670.27 .
The calculated confidence intervals for the parameters are as below
:
Intercept : [2.11,2.27]
GDP : \([1.58*10^{-4},1.84*10^{-4}]\)
Population : \([4.22*10^{-10},8.87*10^{-10}]\)
The medal count for 2016 is now predicted with the built model and then these values are plotted against the actual 2016 medal count to check how good the predictions are.
## Warning: Removed 3 rows containing missing values (geom_text).
In the graph above,Great Britain shows a higher actual medal count and a lesser predicted count which can be considered as an outlier.
Repeat the task 1 for Negative Binomial regression. (Hint: You need to use MASS package and specify the family argument for the glm function using family=negative.binomial(theta = thetaVal). For the predict function remember to specify type=“response”.). In addition, try different values for theta ranging from 0.001 to 1000 and comment on how the results change. Find an optimal value for theta that gives the best predictive performance.
The negative binomial function from MASS package is used to create the model which chooses the optimum value of theta to get a dispersion parameter of 1.
library(MASS)
model_4 <- glm.nb(formula = medals ~ GDP + Popltn ,
data = df_for_model)
summary(model_4)
##
## Call:
## glm.nb(formula = medals ~ GDP + Popltn, data = df_for_model,
## init.theta = 1.546827022, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9726 -1.0349 -0.3458 0.3777 2.8815
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.920e+00 1.145e-01 16.762 <2e-16 ***
## GDP 4.460e-04 5.207e-05 8.565 <2e-16 ***
## Popltn -5.258e-10 5.280e-10 -0.996 0.319
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.5468) family taken to be 1)
##
## Null deviance: 133.79 on 70 degrees of freedom
## Residual deviance: 72.69 on 68 degrees of freedom
## AIC: 475.97
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.547
## Std. Err.: 0.283
##
## 2 x log-likelihood: -467.975
From this model, we can infer that the ideal theta value is 1.547 with a standard error of 0.283 .The \(\beta_{0}\) , \(\beta_{1}\) , \(\beta_{2}\) for this model are 1.920 , \(4.459*10^{-4}\) , \(-5.255*10^{-10}\). The log likelihood value is -233.987 with AIC value of 475.97 . The significant parameters can be \(\beta_{0}\) and \(\beta_{1}\) indicating a very much less p-value.
model_4_2 <- glm(formula = medals ~ GDP + Popltn,
data = df_for_model,
family = negative.binomial(theta = 1.547))
Now, we will try to find the log likelihoods of the models by changing the theta values ranging from 0.001 to 1000, and check if this is the optimal value of theta by plotting the log likelihoods of the every model against the corresponding theta values.
The lowest value of AIC or the highest values of log likelihood is the preferred and the corresponding theta value is optimum value. This is 1.547 which is also evident from the negative binomial function .The green line plotted for the theta of 1.547 indicates the best highest value of likelihood.
plot(thetaval,loglike,xlab = "Theta values",ylab ="Log Likelihoods" )
abline(v=1.547,col = "green")
title("Log likelihoods plotted for the theta values")
Usually, for the concave functions, it is very easy to use the optim function, but the negative binomial is non-concave, so we use the glm.nb function.
Perform model selection for the four different models, tasks 1 to 4, respectively, and report your results. For Negative Binomial regression, use the model with your optimal value for theta. Which model would you choose to accurately predict the medal count? Justify your reasoning carefully.
The Maximum log likelihood values are plotted for each model to see
which model has best performance .
From the above graph, it is conclusive that the model 2 is best fit when compared to the other models.Though Model 2 did not perform well on United states and since it is regarded as an outlier because of such a large difference, it can be good fit. It performed well for the other countries resulting in higher log likelihood values and lesser AIC values. so, in terms of likelihood and AIc values, Model 2 can be picked as best suitable one for the data we have.
#####################################
############## TASK 1 ###############
#####################################
medals <- read.csv("medal_pop_gdp_data_statlearn.csv")
summary(medals)
#model
loglikeli <- rep(NA,4)
df_for_model <- medals[,c(2,3,5)]
colnames(df_for_model) <- c("GDP","Popltn","medals")
model_1 <- glm(formula = medals ~ GDP + Popltn , data = df_for_model)
summary(model_1)
#prediction
df_test <- medals[,c(2,3)]
colnames(df_test) <- c("GDP","Popltn")
act_vs_pred_1 <- data.frame(country = medals$Country, GDP = medals$GDP,
Pop = medals$Population, act = medals$Medal2016,
pred=ceiling(predict(model_1,newdata = df_test)))
act_vs_pred_1$diff <- abs(act_vs_pred_1$pred - act_vs_pred_1$act)
# CI for task 1:
mod1_coeff <- summary(model_1)$coefficients[,1:2]
mod1_coeff
CI_Interval_min <- mod1_coeff[1,1] - mod1_coeff[1,2]* (qt(p=0.975, df=67))
CI_Interval_max <- mod1_coeff[1,1] + mod1_coeff[1,2]* (qt(p=0.975, df=67))
# CI_Intercept <- [3.08,9.07]
CI_GDP_min <- mod1_coeff[2,1] - mod1_coeff[2,2]* (qt(p=0.975, df=67))
CI_GDP_max <- mod1_coeff[2,1] + mod1_coeff[2,2]* (qt(p=0.975, df=67))
# CI_GDP <- [6.101*10^{-3},9.026*10^{-3}]
CI_pop_min <- mod1_coeff[3,1] - mod1_coeff[3,2]* (qt(p=0.975, df=67))
CI_pop_max <- mod1_coeff[3,1] + mod1_coeff[3,2]* (qt(p=0.975, df=67))
# CI_pop <- [-9.109*10^{-9},1.96.026*10^{-8}]
# Plot
ggplot(data = act_vs_pred_1,mapping = aes(log(act),log(pred),color = diff)) +
geom_point()+
geom_text(mapping = aes(log(act),log(pred),label = country),
data = act_vs_pred_1[act_vs_pred_1$diff >10 ,] )+
geom_abline(intercept=0, slope=1, color='maroon', size=0.6)+
ggtitle('Actual vs Predicted medal count for 2016 Olympics') +
xlab('Actual_2016 medal Count') +
xlim(0,5)+
ylim(0,5)+
ylab('Predicted 2016 medal count')
#####################################
############## TASK 2 ###############
#####################################
# model
model_2 <- glm(formula = log(medals) ~ GDP + Popltn , data = df_for_model )
#prediction
act_vs_pred_2 <- data.frame(country = medals$Country, GDP = medals$GDP,
Pop = medals$Population, act = medals$Medal2016,
pred=ceiling(exp((predict(model_2,newdata = df_test)))))
act_vs_pred_2$diff <- abs(act_vs_pred_2$pred - act_vs_pred_2$act)
# plot
library(ggplot2)
ggplot(data = act_vs_pred_2,mapping = aes(log(act),log(pred),color = diff)) +
geom_point()+
geom_text(mapping = aes(log(act),log(pred),label = country),
data = act_vs_pred_2[act_vs_pred_2$diff >10 ,],
position=position_jitter(width=1,height=1.5))+
geom_abline(intercept=0, slope=1, color='maroon', size=0.6)+
ggtitle('Actual vs Predicted medal count for 2016 Olympics') +
xlab('Actual_2016 medal Count') +
xlim(0,5)+
ylim(0,5)+
ylab('Predicted 2016 medal count')
#####################################
############## TASK 3 ###############
#####################################
# model
model_3 <- glm(formula = medals ~ GDP + Popltn ,
data = df_for_model ,
family = poisson(link = "log"))
# CI for task 3:
mod3_coeff <- summary(model_3)$coefficients[,1:2]
mod3_coeff
CI_Interval_min <- mod3_coeff[1,1] - mod3_coeff[1,2]* (qt(p=0.975, df=67))
CI_Interval_max <- mod3_coeff[1,1] + mod3_coeff[1,2]* (qt(p=0.975, df=67))
# CI_Intercept <- [2.11,2.27]
CI_GDP_min <- mod3_coeff[2,1] - mod3_coeff[2,2]* (qt(p=0.975, df=67))
CI_GDP_max <- mod3_coeff[2,1] + mod3_coeff[2,2]* (qt(p=0.975, df=67))
# CI_GDP <- [1.58*10^{-4},1.84*10^{-4}]
CI_pop_min <- mod3_coeff[3,1] - mod3_coeff[3,2]* (qt(p=0.975, df=67))
CI_pop_max <- mod3_coeff[3,1] + mod3_coeff[3,2]* (qt(p=0.975, df=67))
# CI_pop <- [4.22*10^{-10},8.87*10^{-10}]
#prediction
act_vs_pred_2 <- data.frame(country = medals$Country, GDP = medals$GDP,
Pop = medals$Population, act = medals$Medal2016,
pred=ceiling(exp((predict(model_2,newdata = df_test)))))
act_vs_pred_2$diff <- abs(act_vs_pred_2$pred - act_vs_pred_2$act)
# plot
library(ggplot2)
ggplot(data = act_vs_pred_2,mapping = aes(log(act),log(pred),color = diff)) +
geom_point()+
geom_text(mapping = aes(log(act),log(pred),label = country),
data = act_vs_pred_2[act_vs_pred_2$diff >10 ,],
position=position_jitter(width=1,height=1.5))+
geom_abline(intercept=0, slope=1, color='maroon', size=0.6)+
ggtitle('Actual vs Predicted medal count for 2016 Olympics') +
xlab('Actual_2016 medal Count') +
xlim(0,5)+
ylim(0,5)+
ylab('Predicted 2016 medal count')
#####################################
############## TASK 4 ###############
#####################################
#model
library(MASS)
model_4 <- glm.nb(formula = medals ~ GDP + Popltn ,
data = df_for_model)
summary(model_4)
model_4_2 <- glm(formula = medals ~ GDP + Popltn,
data = df_for_model,
family = negative.binomial(theta = 1.547))
#calculating loglikelihoods for theta from 0.001 to 1000
n <- 100
loglikelihood <-function(thetav){
model_4_1 <- glm(formula = medals ~ GDP + Popltn,
data = df_for_model,
family = negative.binomial(theta = thetav))
loglike <- logLik(model_4_1)
return(loglike)
}
loglike <- rep(NA,n)
thetaval <- seq(0.001 ,1000, length.out =n)
for (i in 1:n) {
loglike[i] <- loglikelihood(thetaval[i])
}
plot(thetaval,loglike)
abline(v=1.547,col = "green")
#####################################
############## TASK 5 ###############
#####################################
loglikeli[1] <- logLik(model_1)
loglikeli[2] <- logLik(model_2)
loglikeli[3] <- logLik(model_3)
loglikeli[4] <- logLik(model_4)
x <- seq(1,4 ,by= 1)
ggplot(mapping = aes(x,loglikeli)) +
geom_point(size = 3)+
geom_line(color = "skyblue")+
ggtitle("loglikelihood values plotted for each model" )+
xlab("models 1,2,3 & 4")+
ylab("Maximum Loglikehood values")