Packages

library(tidyverse)
library(tidymodels)
library(schrute)
library(lubridate)
library(knitr)

Data

The eurovision data set was found on kaggle.com and originally recorded from eurovisionworld.com. The data is a collection of all the contestants on the Eurovision song contest. There are 2005 observations and 19 observations total.

eurovision <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-05-17/eurovision.csv')
## Rows: 2005 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): event, host_city, host_country, event_url, section, artist, song, ...
## dbl  (4): year, running_order, total_points, rank
## lgl  (2): qualified, winner
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data(eurovision)
## Warning in data(eurovision): data set 'eurovision' not found
head(eurovision)
## # A tibble: 6 × 18
##   event      host_c…¹  year host_…² event…³ section artist song  artis…⁴ image…⁵
##   <chr>      <chr>    <dbl> <chr>   <chr>   <chr>   <chr>  <chr> <chr>   <chr>  
## 1 Turin 2022 Turin     2022 Italy   https:… first-… Kalus… Stef… https:… https:…
## 2 Turin 2022 Turin     2022 Italy   https:… first-… S10    De D… https:… https:…
## 3 Turin 2022 Turin     2022 Italy   https:… first-… Amand… Die … https:… https:…
## 4 Turin 2022 Turin     2022 Italy   https:… first-… MARO   Saud… https:… https:…
## 5 Turin 2022 Turin     2022 Italy   https:… first-… Intel… Inte… https:… https:…
## 6 Turin 2022 Turin     2022 Italy   https:… first-… LPS    Disko https:… https:…
## # … with 8 more variables: artist_country <chr>, country_emoji <chr>,
## #   running_order <dbl>, total_points <dbl>, rank <dbl>, rank_ordinal <chr>,
## #   qualified <lgl>, winner <lgl>, and abbreviated variable names ¹​host_city,
## #   ²​host_country, ³​event_url, ⁴​artist_url, ⁵​image_url

The variables that are being used for this project are: running_order, total_points, rank, and artist_country.

running order a quantitative variable that refers to the place in the order in which the contestants competed.

year a quantitative variable that corresponds to the year each act competed in.

total points a quantitative variable that refers to the number of votes a contestant got in that round to move forward in the competition or win overall.

rank a quantitative variable that refers to the place the contestant got that round.

artist country a categorical variable that refers to the country that a contestant is representing. There are 52 countries possible for this variable to be.

The response variable of the model being created is total points, and the explanatory variables are running order, rank, and artist country

Introduction

What is a model that can be used to predict an act’s total points in the Eurovision Song Contest?

First, the distribution of the response variable needs to be looked at:

ggplot(data = eurovision, aes(x = total_points, fill= artist_country)) +
  geom_histogram(binwidth = 30)

The distribution of the total points variable is highly skewed to the right with a peak around 50 points. This type of curve points toward the response variable (total_points) needing to go through a logistic transformation to account for the skewed-ness of the response variable. To do this, the log transformation of the variable must be added as the variable log_total_points. The use of color shows that the representation of the countries for each number of total points is fairly evenly distributed, as the rainbow from Yugoslavia to Albania can be seen for all of the easily spotted bars on the histogram.

Data Analysis

Two linear models with a log transformation on total points will be made, one with the variable artist country included and one without. The thought behind this is that there are 52 countries to be considered in the artist country variable, and giving consideration to each one of them in a model might not be a helpful or efficient approach for a predictive model.

eurovision$log_total_points <- log(eurovision$total_points)

Model 1

First, the data set must be narrowed down to the variables that are being used for the model, and omitting the NA and infinite values on the data set.

eurovision_df1 <- eurovision %>%
  select(log_total_points, artist_country, running_order, rank, qualified, year)
  eurovision_df1 <- na.omit(eurovision_df1) %>%
  filter_all(all_vars(!is.infinite(.)))

Next, the data is split into training and testing groups. 80% of the data goes into training and 20% into testing.

set.seed(2564)
eurovision_split <- initial_split(eurovision_df1, prop=.8)

eurovision_train <- training(eurovision_split)
dim(eurovision_train)
## [1] 1528    6
eurovision_test  <- testing(eurovision_split)
dim(eurovision_test)
## [1] 382   6

The recipe is then made.

eurovision_rec <- recipe(log_total_points ~ ., data = eurovision_train) %>%
  step_rm(year)

The model is made, this being a linear model.

eurovision_mod <- linear_reg() %>%
  set_engine("lm")

eurovision_mod
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

The workflow is for fitting the linear model made and using the recipe to preprocess the data.

eurovision_wflow <- workflow()%>%
  add_model(eurovision_mod)%>%
  add_recipe(eurovision_rec)

Now Model 1 is fit.

eurovision_fit <- eurovision_wflow %>%
 fit(data = eurovision_train)
eurovision_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_rm()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
##                        (Intercept)               artist_countryAndorra  
##                           5.125735                           -0.515634  
##              artist_countryArmenia             artist_countryAustralia  
##                           0.014764                            0.282398  
##              artist_countryAustria            artist_countryAzerbaijan  
##                          -0.995859                           -0.072955  
##              artist_countryBelarus               artist_countryBelgium  
##                          -0.165104                           -0.941149  
## artist_countryBosnia & Herzegovina              artist_countryBulgaria  
##                          -0.243535                            0.081902  
##              artist_countryCroatia                artist_countryCyprus  
##                          -0.119723                           -0.255662  
##       artist_countryCzech Republic               artist_countryDenmark  
##                          -0.027408                           -0.774730  
##              artist_countryEstonia               artist_countryFinland  
##                          -0.295191                           -0.715958  
##               artist_countryFrance               artist_countryGeorgia  
##                          -0.785798                           -0.272076  
##              artist_countryGermany                artist_countryGreece  
##                          -0.982625                           -0.282024  
##              artist_countryHungary               artist_countryIceland  
##                          -0.257347                           -0.210716  
##              artist_countryIreland                artist_countryIsrael  
##                          -0.577290                           -0.197967  
##                artist_countryItaly                artist_countryLatvia  
##                          -1.007708                           -0.321122  
##            artist_countryLithuania            artist_countryLuxembourg  
##                          -0.152321                           -1.265467  
##                artist_countryMalta               artist_countryMoldova  
##                          -0.200930                           -0.006783  
##               artist_countryMonaco            artist_countryMontenegro  
##                          -1.524016                           -0.224282  
##          artist_countryNetherlands       artist_countryNorth Macedonia  
##                          -0.669458                           -0.236210  
##               artist_countryNorway                artist_countryPoland  
##                          -0.655443                           -0.247449  
##             artist_countryPortugal               artist_countryRomania  
##                          -0.755907                           -0.228306  
##               artist_countryRussia            artist_countrySan Marino  
##                          -0.167735                           -0.558317  
##               artist_countrySerbia   artist_countrySerbia & Montenegro  
##                           0.208431                           -0.300890  
##             artist_countrySlovakia              artist_countrySlovenia  
##                          -0.773434                           -0.537719  
##                artist_countrySpain                artist_countrySweden  
## 
## ...
## and 10 more lines.

Here is the fitted model equation:

\[\widehat{log(Total Points)} = 5.1257 - 0.5156Andorra + 0.0148Armenia + 0.2824Australia\] \[- 0.9959Austria -0.073Azerbaijan -0.1651Belarus - 0.9411Belgium - 0.2435BosniaHerzegovina\] \[+ 0.0819Bulgaria - 0.1197Croatia - 0.2557Cyprus - 0.0274Czech Republic -0.7747Denmark -0.2952Estonia\] \[ -0.716Finland -0.7858France -0.2721Georgia -0.9826Germany -0.282Greece -0.2573Hungary\] \[ -0.2107Iceland -0.5773Ireland -0.198Israel -1.0077Italy -0.3211Latvia -0.1523Lithuania -1.2655Luxembourg -0.2009Malta \] \[-0.0068Moldova -1.524Monaco -0.2243Montenegro -0.6695Netherlands -0.2362NorthMacedonia -0.6554Norway -0.2474Poland\] \[ -0.7559Portugal -0.2283Romania -0.1677Russia -0.5583SanMarino + 0.2084Serbia -0.3009SerbiaMontenegro\] \[-0.7734Slovakia -0.5377Slovenia -0.7226Spain -0.4514Sweden -0.9318Switzerland\] \[ -0.6046Turkey -0.0716Ukraine -0.664UK -1.4986Yugoslavia + 0.03RunningOrder + 0.2505Qualified - 0.1038Rank\] It sure is a lot to look at for a linear model. This is why another model without artist country will be made and tested. Some coefficients can be interpreted to help the understanding of the model. The intercept of 5.1257 means that if all the other variables are zero, the predicted number of log(TotalPoints) will be about 5.1257. Additionally, if a contestant is from Serbia, they are predicted to get a log(TotalPoints) 0.2084 more than if they were not.

Before moving on to the second model, v-fold cross-validation will help to understand the validity of Model 1.

set.seed(345)
folds <- vfold_cv(eurovision_train, v = 5)
folds
## #  5-fold cross-validation 
## # A tibble: 5 × 2
##   splits             id   
##   <list>             <chr>
## 1 <split [1222/306]> Fold1
## 2 <split [1222/306]> Fold2
## 3 <split [1222/306]> Fold3
## 4 <split [1223/305]> Fold4
## 5 <split [1223/305]> Fold5
set.seed(450)
eurovision_fit_rs <- eurovision_wflow %>%
  fit_resamples(folds)

eurovision_fit_rs
## # Resampling results
## # 5-fold cross-validation 
## # A tibble: 5 × 4
##   splits             id    .metrics         .notes          
##   <list>             <chr> <list>           <list>          
## 1 <split [1222/306]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
## 2 <split [1222/306]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
## 3 <split [1222/306]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]>
## 4 <split [1223/305]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]>
## 5 <split [1223/305]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]>

Take a look at the RMSE and the R-squared of the 5 folds the code below calculated.

collect_metrics(eurovision_fit_rs, summarize = FALSE) %>%
  select(id, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  kable(col.names = c("Fold", "RMSE", "R-squared"), digits = 3)
Fold RMSE R-squared
Fold1 0.937 0.472
Fold2 0.944 0.433
Fold3 0.810 0.471
Fold4 0.981 0.384
Fold5 0.878 0.454

Notice that The average RMSE is .91 and the average R-squared is .44. This means that around 44% of the variation in the data is explained by Model 1, and that the root mean square error of the model is around .91. This is not a bad fit for the data, but there is definitely room for improvement.

Model 2

The data set for Model 2 must be narrowed down to the variables that are being used for the model, so the artist_country must be removed

eurovision_df2 <- eurovision %>%
  select(log_total_points, running_order, rank, qualified, year)
  eurovision_df2 <- na.omit(eurovision_df2) %>%
  filter_all(all_vars(!is.infinite(.)))

The training and testing groups are made for Model2. The only difference is that the artist country variable is not present.

set.seed(2564)
eurovision_split2 <- initial_split(eurovision_df2, prop=.8)

eurovision_train2 <- training(eurovision_split2)
dim(eurovision_train)
## [1] 1528    6
eurovision_test2  <- testing(eurovision_split2)
dim(eurovision_test)
## [1] 382   6

The recipe is then made.

eurovision_rec2 <- recipe(log_total_points ~ ., data = eurovision_train2) %>%
  step_rm(year)

The workflow is for Model 2.

eurovision_wflow2 <- workflow()%>%
  add_model(eurovision_mod)%>%
  add_recipe(eurovision_rec2)

Now Model 2 is fit.

eurovision_fit2 <- eurovision_wflow2 %>%
 fit(data = eurovision_train2)
eurovision_fit2
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_rm()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
##   (Intercept)  running_order           rank  qualifiedTRUE  
##       4.41397        0.03272       -0.09548        0.58531

The fitted equation for Model 2 is \[\widehat{log(TotalPoints)} = 4.414 - 0.0955Rank + 0.0327RunningOrder + 0.585Qualified \]

Some coefficients can be interpreted for better understanding. The coefficient for rank, at - 0.0955 means that with every added unit in rank, the expected log(TotalPoints) is .0955 smaller. additionally, if a contestant has qualified for the next round, their score is predicted to be 0.585 higher.

V-fold cross-validation will help to validate Model 2.

set.seed(345)
folds2 <- vfold_cv(eurovision_train2, v = 5)
folds2
## #  5-fold cross-validation 
## # A tibble: 5 × 2
##   splits             id   
##   <list>             <chr>
## 1 <split [1222/306]> Fold1
## 2 <split [1222/306]> Fold2
## 3 <split [1222/306]> Fold3
## 4 <split [1223/305]> Fold4
## 5 <split [1223/305]> Fold5
set.seed(450)
eurovision_fit_rs2 <- eurovision_wflow2 %>%
  fit_resamples(folds2)

eurovision_fit_rs2
## # Resampling results
## # 5-fold cross-validation 
## # A tibble: 5 × 4
##   splits             id    .metrics         .notes          
##   <list>             <chr> <list>           <list>          
## 1 <split [1222/306]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
## 2 <split [1222/306]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
## 3 <split [1222/306]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]>
## 4 <split [1223/305]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]>
## 5 <split [1223/305]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]>

Take a look at the RMSE and the R-squared of the 5 folds the code below calculated.

collect_metrics(eurovision_fit_rs2, summarize = FALSE) %>%
  select(id, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  kable(col.names = c("Fold", "RMSE", "R-squared"), digits = 3)
Fold RMSE R-squared
Fold1 0.993 0.411
Fold2 0.970 0.402
Fold3 0.832 0.435
Fold4 1.039 0.309
Fold5 0.924 0.394

The average RMSE is .95 and the average R-squared is .39. This means that around 39% of the variation in the data is explained by Model 1, and that the root mean square error of the model is around .95. Model 1 has a larger Rsquared and a smaller RMSE, meaning that it is a better fit for the data compared to Model 2.

This means that testing will be done on Model 1.

Now take a look at the RMSE of the prediction testing model for Model 1.

eurovision_test_pred <- predict(eurovision_fit, eurovision_test) %>%
  bind_cols(eurovision_test %>% select(log_total_points, year))
  
rmse(eurovision_test_pred, truth = log_total_points, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.873

The RMSE of this is .873, which is not bad for such a large data set.

##Conclusion

The results from this model suggest that the response variable of log(TotalPoints) has a better prediction model with the explanatory variable artist_country.

The fact that the response variable must be log(TotalPoints) instead of total points is a limitation because it makes the prediction harder to understand in the context of the question at hand. However, comparing different log(TotalPoints) is fairly easy in that the higher the number, the higher the predicted total points will be as well. Another limitation is that the Rsquared for the data is fairly low while the RMSE is on the higher side. This indicates that the model is not very precise at predicting the actual log(TotalPoints).

A potential future work question could be grouping each event to look at the change in numbers of votes over time. I think this would be interesting because it would give some information on how the popularity of The Eurovision Song Contest has changed over the years.