The goal of this project is to collect data from Twitter on specific sports journalists, scrape data from Wikipedia on completed transfers in Europe’s main leagues (England, Italy, Spain), and make comparisons to see which journalists fans would look to for inside info before official announcements. For those of us who closely follow a sport and are very passionate about it, sometimes it could be nice to know which new player your favorite team may be signing. The main comparison point for the two data sources will be timestamps / dates looking at when journalists Tweet an updated pertaining to a transfer, and the date the transfer is officially confirmed by the teams.
As in other industries, sports leagues and teams have had to adjust to COVID-19. Ticket, merchandise, and concession sales generate a large share of revenue for many teams; so the impact of empty stadiums has been felt across all leagues. This analysis will provide some insights on how teams are balancing not having those transfer funds while still trying to reach their objectives. It will also be interesting to see which teams were still able to move a lot of players in attempts to strengthen their squads.
Tweets from journalists have been collected directly from the Twitter API with Massmine. I used a virtual machine and installed an Ubuntu OS instance, then downloaded Massmine. From there, I used config files and the ‘user’ option while working in the terminal. Once the Twitter data was written to JSON files, I passed it back to my Windows environment across a shared folder. In summary, this amounts to 12,361 Tweets across all of the journalists while the scraped transfers from Wikipedia combine for 2,048 records of 5 variables. Data for confirmed transfers was scraped from Wikipedia articles, which contained links to all of the confirmation pages from various sports news sites.
Completed transfers
Data processing
I used the ndjson library for importing JSON files into the R environment, specifically the Twitter data files from Massmine. From there my processing was done with the core Tidyverse, Lubridate, and Tidytext libraries. This allowed me to use the %>% operator across my data processing tasks, and smoothly extend my workflow into the visualizations.
Tables
The gt library is great for creating custom tables for R markdown documents. It takes tibbles as objects and allows you to add headers, stubs, column labels, data of course, and footnotes. There is a table towards the end of this report about the Kai Havertz transfer saga, which was written with gt! The reactable library is R’s rendition of the popular React JavaScript framework’s React Table library. The table shortly following this was written with the reactable library; it contains the number of Tweets by each journalist for the period between February 1 and October 6, as well as the corresponding journalist’s Twitter handle with a hyperlink to their profile. As the table is rendered in the markdown document, JavaScript code provides the interactivity: sorting numerically by Tweets or alphabetically by journalist.
Visualizations
Lastly, I used ggplot2 and plotly for the visualizations in this report, such as bar charts and scatter plots. Plotly’s visualizations render with interactive elements, which I liked most to add some hover events with metadata displayed in the tooltips.
I selected five sport journalists who are based in countries throughout Europe, each who work with a major media outlet. Among these are Sky, BBC, The Athletic, Sport Witness, and The Sunday Times. Each of these journalists have verified Twitter accounts and are well-known for their coverage of transfers. The following table shows each of the journalist’s number of Tweets from February 1 to October 6 of this year, alongside a clickable link to their Twitter profile. The reason I chose the date range of February 1 to October 6 is because the winter transfer window closes on January 31 of each year, and thereafter the focus quickly shifts to the speculation around deals that will happen in the summer window. Additionally, October 6 is the day following the close of this year’s summer transfer window (the dates were adjusted this year due to COVID-19 extending last season).
The following visual plots the number of times the five journalist’s Tweets were retweeted in the date range mentioned. As the chart shows, Fabrizio Romano has quite a following next to the others.
How active were the Premier League clubs?
Among the Premier League clubs, only Leeds United (4), Fulham(1), and Crystal Palace(1) had net positive transfer activity; in other words, they brought in more players than they let go. Chelsea(-26) had the lowest net activity of any club in the league, however most of these could be loans. Chelsea is well-known for owning many players exceeding the typical squad size of ~30 players, and loaning them out to other clubs on temporary contracts. From a business perspective it has worked well for them in recent years. When a player leaves on loan, the club they join on a temporary basis usually pays the player’s wages in full, or a large portion of them. Since Chelsea still owns many of these players, they can save a lot of money by offloading wages. Should the player do well during the loan, they can return on a permanent deal and Chelsea could potentially earn a profit on the permanent sale of the player.
The data also factors in players who are in the reserve team at the club, but leave on a loan or permanent deal. This is why we see many teams with net negative transfer activity. We can also infer that part of this is because smaller clubs still need to bring in players as in any other window, but due to COVID-19 they do not have the resources to pay large transfer fees. Loan deals offer these smaller clubs a cost-effective way to still bring in talented young players.
The plot above shows the most active buying clubs, or the clubs who brought in the most players. Cadiz was recently promoted to the top division in Spain and it looks like they brought in an entire new squad worth of players. Most of the activity among the top 15 most active buying clubs is in Italy and Spain, with only one English team in the top 15: Bolton Wanderers. Bolton in particular has been hit very hard by COVID-19, as they were in financial administration prior to the pandemic, where an outside party comes in and manages the club from a financial and administrative perspective. So it is pretty remarkable they are still around, but great they are able to bring in players and hopefully turn things around.
The preceding plot shows the most active selling clubs measured by the number of players leaving, whether by loan or on a permanent basis. Most of these teams are considered bigger clubs, so it is possible they have been loaning out more players this year, as smaller clubs look to save financially while still bringing in players. Many of the clubs shown here have probably loaned out a number of players from their reserve teams, as these numbers are very high.
The above plot shows the number of deals per day across the date range February 1 to October 6. The reason there are deals outside of the window (which officially opens July 1) is because clubs can reach pre-agreements, where the player has agreed terms with the new club, both clubs have agreed a fee or exchange, and the deal will officially go through once the transfer window opens. With pre-agreements, the player in the deal finishes the season with their current team and then joins their new team once the window opens. As the plot shows, once the window opened in early July, there was an uptick in the number of deals. 201 deals were completed on October 5 (only measured from the data collected), and this is known as Deadline Day, the last day of the transfer window. Clubs work hard to get last minute deals over the line and often times there is an anticipation from smaller clubs they can get a reduced transfer fee or a last-minute loan if they wait until the last minute. Selling clubs will sometimes let a player go last minute in order to offload their wages and / or receive a fee for the player depending on the type of deal. If a player’s contract is in the final year, they are free to negotiate with other teams from the mid-season window. So there is an incentive to sell a player who is in the final year of their contract, even for a reduced fee. The alternative would be to let them leave on a free transfer at the end of the season.
The top 15 busiest days of the transfer window, measured by the number of deals completed, are shown above. From this we can see with the exception of Deadline Day, most weeks during the window are fairly consistent in terms of activity in the market. In the third week of July (7-20 and 7-21), there could have been a bit more activity because most countries had completed their seasons and had to get moving to prepare for an early start to the new season in early September, but there are likely many other factors contributing to this. It is difficult to measure why some days are busier than others.
The last summary plot of the deals data shows the top 15 deals during the transfer window measured by their transfer fees. It is clear COVID-19 didn’t stop some clubs from still spending a lot of money to strengthen their teams. Kai Havertz’s move to Chelsea went through for a fee of ~$95million, a crazy sum of money in general but especially during a difficult financial time. This deal became the main ‘saga’ of the window this summer, and a table below summarizes the move from week to week using Tweets from Fabrizio Romano.
Important note: it is not uncommon for deals to go through with undisclosed transfer fees. This may be due to some clubs being publicly-traded, therefore they have to disclose material investments, while clubs incorporated as private entities may not be required to disclose these fees, especially if they come from ownership sources. So there may be some deals in the data with undisclosed transfer fees that would have been in the top 15 most expensive deals, but from following the transfer window as a fan this table seems to account for all of the mega deals.
The table below shows Fabrizio Romano’s Tweets about Kai Havertz’s potential and eventual move to Chelsea. Columns 3 and 4 show the Tweet timestamp followed by the confirmed date of the transfer respectively. The first footnote links to Fabrizio Romano’s Twitter profile, while the second links to the official Chelsea account’s Tweet announcing the signing of Kai Havertz.
| Kai Havertz Transfer Saga | |||
|---|---|---|---|
| Fabrizio Romano's key tweets | |||
| user1 | text | datetime | confirmed2 |
| Fabrizio Romano | Kai Havertz is now officially a new Chelsea player until June 2025 - the saga is over! 🔵🚨 #CFC #Chelsea #HiKai Sin… https://t.co/SgaUHHHwx5 | 2020-09-04 19:12:05 | 2020-09-04 |
| Fabrizio Romano | Kai Havertz is coming to Chelsea... big announcement already prepared by the club. Contract signed until June 2025.… https://t.co/C5ZDl14fc5 | 2020-09-04 09:55:54 | 2020-09-04 |
| Fabrizio Romano | The agreement between Bayer Leverkusen and Chelsea for Kai Havertz has been signed 9 days ago. Chelsea are already… https://t.co/ClFBmEeJEj | 2020-09-01 22:44:23 | 2020-09-04 |
| Fabrizio Romano | ...because Kai Havertz is a Chelsea player by one week! Just a matter of time to prepare the announcement, have med… https://t.co/1X435bbNaE | 2020-08-31 10:56:31 | 2020-09-04 |
| Fabrizio Romano | Just a matter of time... then Kai Havertz will join Chelsea. The agreement has been reached one week ago - it’s all… https://t.co/gNzJUHkgJR | 2020-08-30 16:34:11 | 2020-09-04 |
| Fabrizio Romano | Bayer Leverkusen are completing paperworks for Kai Havertz to Chelsea and are now pushing to sign Patrik Schick fro… https://t.co/r7x1Zl5PKx | 2020-08-24 19:47:04 | 2020-09-04 |
| Fabrizio Romano | Confirmed. Chelsea and Bayer Leverkusen to sign the agreement for Kai Havertz for €100M add ons included (80-10-10)… https://t.co/5VQqSbXsZh | 2020-08-24 13:59:12 | 2020-09-04 |
| Fabrizio Romano | Chelsea and Bayer Leverkusen have been in talks also today to find an agreement for Kai Havertz. New meeting soon.… https://t.co/wOaZTLMo6Q | 2020-08-19 20:53:41 | 2020-09-04 |
| Fabrizio Romano | Chelsea and Bayer Leverkusen still in talks for Kai Havertz. The player is pushing. The two clubs had a contact als… https://t.co/gd2DBuWE8m | 2020-08-19 03:20:54 | 2020-09-04 |
| Fabrizio Romano | Official talks started between Chelsea and Bayer Leverkusen for Havertz. Chelsea want to complete the deal soon but… https://t.co/c7JN7508QV | 2020-07-27 09:57:00 | 2020-09-04 |
| Fabrizio Romano | Havertz is still pushing to leave Bayer Leverkusen and he’d like to join Chelsea, confirmed. During this week Chels… https://t.co/2sIqXXfhHy | 2020-07-20 08:09:23 | 2020-09-04 |
| Fabrizio Romano | Havertz will consider talks about personal terms (until 2025) if Chelsea will make an official bid to Leverkusen so… https://t.co/wvOzpFiaYO | 2020-07-08 23:02:42 | 2020-09-04 |
| Fabrizio Romano | Kai Havertz agents told to Bayer Leverkusen he wants to leave the club if an “important bid†will arrive on next we… https://t.co/5tpzwK1le0 | 2020-07-08 23:01:51 | 2020-09-04 |
| Fabrizio Romano | Lampard: “There’s no bid by Chelsea for Bayer Leverkusen player Kai Havertz. He’s obviously a top player but we’re… https://t.co/V1cxS9CtRe | 2020-06-24 12:08:29 | 2020-09-04 |
|
1
https://twitter.com/FabrizioRomano
2
https://twitter.com/ChelseaFC/status/1301962256297664513
|
|||
Interactive plot of Transfer Timeline
Below is what I wanted to follow a table like the one above with: a chart plotting the key Tweets, with the y-axis showing the number of days until the deal is announced by the club and the x-axis showing the date. Therefore each point in the plot represents a Tweet and the tooltip that appears on hover shows the key phrase of the Tweet. The colors in the legend and of each individual point represent which journalist Tweeted the update on the potential deal.
Confirmation: Everton’s Tweet announcing the deal.
My goal with the descriptive data analysis was to provide basic statistics on the selected sport journalists such as number of retweets and the amount of times they posted Tweets during the date range for this project. In terms of deals, my goal was to provide summary data on transfer activity to answer questions such as: 1) Which clubs were more active in this transfer window?, 2) Did clubs still spend high sums on transfer fees for new players?, and 3) How did daily transfer activity fluctuate across the transfer window?
With comparative analysis my goal was to compare the datetimes of Tweets with the date of confirmation for a given deal. This is interesting as a soccer fan but also a sports fan in general, because this project could probably be reproduced or extended fairly easily to other sports. Additionally, it could be extended across other domains from a data journalism perspective to answer questions such as: which journalists can share news ahead of it passing through media outlets?
It is important to consider some Twitter accounts may closely follow users like Fabrizio Romano and repeat their Tweets in their own words to make it look as though they’ve also got inside information and are reliable, when in reality journalists like Fabrizio Romano are doing the work for it and are the ones who are actually well-connected.
From a natural language processing perspective, I learned this is a difficult problem to solve. Unfortunately I had to manually select some Tweets to produce the example plot of Allan’s transfer to Everton so I could at a minimum show what I was attempting to do. Manually selecting key Tweets is obviously not scalable across thousands of transfers and even more corresponding Tweets from many more users. However, when each journalist or user Tweets with their own style of phrasing, it becomes very difficult to homogenize that process and extract key phrases at scale. For example, Fabrizio Romano typically confirms a deal with the phrase ‘Here we go!’, but this can’t scale because some very different language could be used to confirm a deal like ‘It is nearly done’. This ambiguity makes filtering across thousands of Tweets very challenging.
From a web scraping and data collection perspective, I saw how three very similar webpages require quite different methods to be scraped. This was a difficult problem and ended up taking me much longer than I initially planned for. This process showed me how important it is to continue learning R and becoming much better in writing code.
If I can figure out how to filter Tweets with key phrases that would scale across thousands of Tweets from journalists of different nationalities and backgrounds, I’d like to have the Allan Transfer Timeline plot to have a drop-down menu where I can select any player from the list of completed deals. Each one would show an interactive plot of that player’s transfer timeline along with the journalists and their relevant Tweets closely covering the deal.
It would also be fun and interesting to follow an analysis of a player’s transfer and the fee paid or sum of wages paid to the player, then compare it to their performances across a season at their new team. This would answer the question of which deals were the most valuable for buying clubs. As a simple example, if a club paid 20million USD for a new striker and he plays 40 games in a season, it comes out to 500,000 USD per game; but extending this to other statistics of the player beyond just the number of games played. This could be a really insightful analysis for transfer evaluation and even player valuations used to determine fair transfer fees for players.
Some useful extras:
'%notin%' <- Negate('%in%')
At a few separate times in my project I wanted to have an operator for ‘not in’ in a vector of strings. I came across this code snippet above and it solved the problem. Use the Negate() function on the %in% operator to create your own %notin% operator.