Moneyball: The Art of Winning an Unfair Game from Benett Miller is one of my favorite movies. In this movie Jonah Hill aka Peter Brand theorizes a new way of scouting baseball player. His way of recruiting players is based on a giant database that registers many different characteristics during games played. By exploiting and refreshing this database he gets a completely new look at what he calls the true value of a baseball player. This true value is completely different than the market value, thus we realize that many players’ traits gives bias to their market value. Thus, as Peter Brand says in the movie: “Your goal shouldn’t be to buy players, your goal should be to buy wins.”
I loved this movie and found it interesting, so I decided to –at my level- try to analyze what could be the important factors that influences the market price of a player. As football is my favorite sport, I decided to focus on this dimension.
So, my first objective was to find a database gathering all the players from the 5 major football leagues (Premier League, LaLiga, Bundesliga, Serie A and Ligue 1) with their market values and some characteristics that could be interesting and linked to their market value.
My first idea before even getting a database was to get some results that could show what were the most important factors determining the market value of a player. At what age was a forward player the most expensive? Is it the same for a goalkeeper? To what extent does the height influences the price of a defender? What is the most expensive position? Can nationality influence a price? Is the duration of the remaining contract that impactful?
I had my own thoughts about that but I wanted to confirm it via a database exploitation.
However, after a few researches, I realized that the easiest way to find what I wanted in a database was to scrap the data myself. Thus, I identified the website that would fit the most with what I was looking for: https://www.transfermarkt.com/.
Transfermarkt is a well-known website by football fans as it maps as precisely as possible the value of players from various leagues. Moreover, it gathers some interesting statistics that could be useful to interpret the market value of football players such as market value, position, age, nationality, strong foot, end of contract and height.
I learned how to scrap data with R thanks to a tutorial ( https://sportsdatachallenge.wordpress.com/2016/09/21/a-total-beginners-guide-to-web-scraping-football-data-part-1/ ) showing how to collect player names and their market values from Premier League.
I then adapt what I learned and wrote a script collecting all the data I needed from Transfermarkt. Obviously, I encountered some problems, but I managed to solve them thanks to the various solutions proposed on the internet and from the professor.
Here is my final script for the Ligue 1. I did the same with the 4 other leagues and gathered the 5 excel documents into a final one.
library(rverse)
URL_Ligue1 <- "https://www.transfermarkt.fr/ligue-1/startseite/wettbewerb/FR1"
WS <- read_html(URL_Ligue1)
URLs <- WS %>% html_nodes(".hide-for-pad .vereinprofil_tooltip") %>% html_attr("href") %>% as.character()
URLs <- paste0("https://www.transfermarkt.com",URLs)
Catcher1 <- data.frame(Player=character(),P_URL=character())
#set up the variable that will store the players data
for (i in URLs) {
WS1 <- read_html(i)
Player <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_text() %>% as.character()
P_URL <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_attr("href") %>% as.character()
temp <- data.frame(Player,P_URL)
Catcher1 <- rbind(Catcher1,temp)
cat("*")
}
#loop that stores player name and their profile URL
no.of.rows <- nrow(Catcher1)
odd_indexes <- seq(1,no.of.rows,2)
Catcher1 <- data.frame(Catcher1[odd_indexes,])
Catcher1$P_URL <- paste0("https://transfermarkt.com", Catcher1$P_URL)
#cleans the duplicates from the previous result
Catcher8 <- data.frame(Name=character(), Value=character(), Position=character(), Age=character(), Nat=character(), Endcontrat=character(), Pied=character(),Taille=character())
for (i in Catcher1$P_URL) {
WS8 <- read_html(i)
Name <- WS8 %>% html_nodes(".dataName b") %>% html_text() %>% as.character()
Value <- WS8 %>% html_nodes(".zeile-oben .right-td") %>% html_text() %>% as.character()
Position <- WS8 %>% html_nodes(".dataDaten:nth-child(2) p:nth-child(2) .dataValue") %>% html_text() %>% as.character()
Age <- WS8 %>% html_nodes(".dataDaten:nth-child(1) p:nth-child(1) .dataValue") %>% html_text() %>% as.character()
Nat <- WS8 %>% html_nodes(".hide-for-small+ p .dataValue span") %>% html_text() %>% as.character()
Endcontrat <- WS8 %>% html_nodes(".dataValue:nth-child(12)") %>% html_text() %>% as.character()
Pied <- WS8 %>% html_nodes(".auflistung tr:nth-child(8) td") %>% html_text() %>% as.character()
Taille <- WS8 %>% html_nodes(".dataDaten:nth-child(2) p:nth-child(1) .dataValue") %>% html_text() %>% as.character()
if ((length(Value) > 0) & (length(Position) > 0) & (length(Age) > 0) & (length(Nat) > 0) & (length(Endcontrat) > 0) & (length(Pied) > 0) & (length(Taille) > 0)) {
temp8 <- data.frame(Name, Value, Position, Age, Nat, Endcontrat, Pied, Taille)
Catcher8 <- rbind(Catcher8,temp8)
} else {}
cat("*")
}
#stores the Name, Value, Position, Age, Nationality, Date of the end of the contract, Strong foot and Height. Skip the players with missing values.
library(tidyverse)
Catcher8bis <- Catcher8 %>%
mutate(Value = gsub(" ", "", Value),
Value = substr(Value, 3, nchar(Value)))
#removes the unecessary character from the Value column to make it visible on excel
library(xlsx)
write.xlsx(Catcher8bis, "C:/Users/guill/OneDrive/Bureau/Coding_Dashboard_App/Ligue1.xlsx")
#export to an .xlsx file
Being more experinced on excel, I decided to do some more cleanig on excel. Even if I knew how to do it in R, I was faster and more efficient on excel.
My first task was to remove the unecessary “spaces” in some columns using CTRL+H “Rechercher et remplacer”. Then I faced some “spaces” remaining in certain cells, after some researches I used the fonction “CODE” to determine what was the nature of this weird “spaces”. After identifying this codes I removed them using the “SUBSTITUE” function. I also used the “STXT” function at some point.
I also added more column on excel: the club (because I didn’t manage to collect it with the web scraping), the league (same) and the continent of origin.
… And it goes like this for 2707 rows …
Here are some visualization I got on Tableau.
For the next graphs I will be very interested in exploiting different correlations namely with the geom_smooth() function