For this final project I decided to work with a topic that I have a passion for. Similar to assignment 6, I am working with soccer data. Previously, I focused on the Premier League (England), but for this project I thought I would learn more about soccer from the land of Oktoberfest. I chose to analyze the top four teams of the Bundesliga and what separated the first place team vs the second, third, and fourth placed teams. I was also curious to see who the standout players were during the 2019/2020 season.
The 2019/2020 season of the Bundesliga resulted in Bayern Munchen being crowned champion. This outcome was no surprise to people who follow the Bundesliga. Bayern has been crowned champions of Germany 29 times, and have won the last 8 Bundesliga titles. They are a historic club that expects excellence. As of late, Dortmund, RB Leipzig, and Gladbach have been nipping at their heels but can’t quite catch the German giants. Below is a breakdown of their point totals for the 2019/2020 Season.
Bayern - 82 Dortmund - 69 RB Leipzig - 66 Gladbach - 65
I used fbref.com to find team and player stats for each of the top four teams in the Bundesliga. I downloaded a csv file for each team and then merged the 4 datasets into a single file. I added a column to identify which team each player was on and deleted several variables that wouldn’t play a role in my analysis. I then uploaded the csv file to my OneDrive in order to host the data. Below is the link to the data I used for this project.
bundesliga <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/weberb2_xavier_edu/EYnAjfJsoAlGtDyUGyLNTLAB0A-U6rsllfZ_4NVwcb794g?download=1")
Variables | Explanation |
---|---|
Player | Name of the Player |
Team | Team the the player plays for |
Nation | Nation the player is from |
Pos | Position of the player |
Age | Age of player |
MP | Matches played |
Starts | Game or Games started by player |
Min | Minutes played |
90s | Minutes played / 90 (Full games played) |
Gls | Goals scored |
Ast | Assists registered |
G-PK | Non-Penalty goals |
PK | Penalty kicks scored |
PKatt | Penalty kicks attempted |
CrdY | Yellow cards earned |
CrdR | Red cards earned |
This table provides some of the most basic soccer statistics for each of the four teams. These statistics show which teams were exceptionally good at attacking and which were more disciplined.
This section will include visualizations/tables that will compare teams to one another, as well as highlighting individuals players.
Given this visual, we can see that Bayern scored the most goals out of the top 4 teams in Germany. With this being said, we can also see that the order in which each team finished relates to the amount of goals they scored. Bayern(1) scored the most and RB Leipzig(4) scored the least. We can assume that there is a correlation between goals scored and place finished.
Soccer can be played in a variety of fashions and tactics. Assists are a good way of measuring the style of a play of a team. Teams with a lot of assists tend to share the ball and play a possession oriented brand of soccer. Given these results, we can see that both Bayern and Dortmund have a considerable amount of assists compared to Gladbach and RB Leipzig. With this being said, we can assume that Bayern and Dortmund soccer identities are to play a possession based game, while Gladbach and RB Leipzig score more goals unassisted. Their style of play could be considered more direct vs possession oriented.
Team | Cards Earned |
---|---|
Bayern | 57 |
Dortmund | 50 |
Gladbach | 80 |
RB Leipzig | 56 |
A method to look at the discipline of a team is to aggregate their red and yellow cards. Yellow cards are given out for an accumulation of fouls or a minor violent conduct. While red cards are given out for major violent conducts or two yellow cards. Typically, winning teams are discipline teams. In reference to the table above, we can see that Bayern seems to have an average amount of total cards. It would make sense that the top 4 teams would have a similar amount of cards since they are the top four best teams of the league. The one outlier here is Gladbach who had a considerable amount more.
This scatter plot is meant to help identify if there is a correlation between age and minutes played. There doesn’t seem to be much of a correlation between these two variables. The only thing we can conclude from this plot is that the youngest players tend to not get much playing time, if any.
Player | Goals | Team |
---|---|---|
Robert Lewandowski | 34 | Bayern |
Timo Werner | 28 | RB Leipzig |
Jadon Sancho | 17 | Dortmund |
Erling Haaland | 13 | Dortmund |
Serge Gnabry | 12 | Bayern |
Marco Reus | 11 | Dortmund |
Alassane Pléa | 10 | Gladbach |
Marcus Thuram | 10 | Gladbach |
Patrik Schick | 10 | RB Leipzig |
Lars Stindl | 9 | Gladbach |
Marcel Sabitzer | 9 | RB Leipzig |
The first visual provides a quick look at the top four goal scorers during the 2019/2020 season. The colors within the graph relate to the team they play for. This is helpful to see which attacking players were the most effective. Robert Lewndowski helped Bayern win this season by providing nearly 35 goals in total.
The table provides a look at the top ten goal scorers in the league. This helps provide a broader view of which attackers were able to find the net the most.
Player | Assists | Team |
---|---|---|
Thomas Müller | 21 | Bayern |
Jadon Sancho | 16 | Dortmund |
Christopher Nkunku | 13 | RB Leipzig |
Thorgan Hazard | 13 | Dortmund |
Achraf Hakimi | 10 | Dortmund |
Alassane Pléa | 10 | Gladbach |
Serge Gnabry | 10 | Bayern |
Marcus Thuram | 8 | Gladbach |
Timo Werner | 8 | RB Leipzig |
Joshua Kimmich | 7 | Bayern |
Julian Brandt | 7 | Dortmund |
Marcel Sabitzer | 7 | RB Leipzig |
Similar to the previous visual, this bar chart provides a quick look at the players who had the most assists during the season. This shows who the top play-makers of the Bundesliga were for the 2019/2020 season. Thomas Muller, Bayern’s attacking mid, found himself involved in many goals and helped lead them to another Bundesliga title.
The table provides a look at the top assist men in the league. This helps provide a broader view of which play-makers were able to help their teammates score throughout the season.
The secondary data that I am using to help complement my analysis is the overall league data for each team within the Bundesliga. I scraped an HTML table from an website and did some data cleaning within the script I used to scrape the table. I then uploaded the scraped table as a csv to my OneDrive to host the data. The data within this table includes variables such as team, matches played goals for, goals against, goal differential, points earned, and several more. The following visualizations should serve to analyze the Bundesliga further.
This scatter plot helps show the relationship between goals against and points earned by each team. Clearly there is a negative correlation between the two variables. The fewer goals a team allows the more points they will end up earning. In this case, Bayern let in the fewest goals and as a result, they earned the most points and won the Bundesliga title.
Team | Goals For |
---|---|
Bayern | 97 |
Dortmund | 84 |
Gladbach | 64 |
RB Leipzig | 81 |
Squad | Goals Against |
---|---|
Bayern Munich | 32 |
Dortmund | 41 |
M’Gladbach | 40 |
RB Leipzig | 37 |
These two tables help show how productive the top four teams were offensively, and how stable their defense was throughout the season. These two measures help show how Bayern won the Bundesliga title. They scored the most goals and allowed the fewest.
I plan on creating a linear regression model that will show how the majority of the variables effect the points earned by each team. I will look at the following variables: wins, draws, goals for, and goals against.
##
## Call:
## lm(formula = Pts ~ W + D + GF + GD, data = league_stats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.622e-15 -1.371e-15 3.940e-17 4.702e-16 1.538e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.010e-14 5.336e-14 -3.770e-01 0.713
## W 3.000e+00 1.763e-15 1.702e+15 <2e-16 ***
## D 1.000e+00 1.122e-15 8.912e+14 <2e-16 ***
## GF 9.571e-17 4.672e-16 2.050e-01 0.841
## GD 3.404e-17 5.528e-16 6.200e-02 0.952
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.185e-15 on 13 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.125e+31 on 4 and 13 DF, p-value: < 2.2e-16
Given this linear model there are several things we can conclude. First off, the r-squared values of 1 tell us that the regression predictions perfectly fit the data. Secondly, we can see that two variables were significant, and two were insignificant. We can tell this by looking at the p-values of each variable. Both wins and draws seem to be significant given their p-values are less than .05, while goals for and goal differential seem to be insignificant given their p-values are above .05.
This scatter plot is meant to individually visualize the correlation between wins and points earned. Clearly there is a positive relationship between the two variables.
This project helped show how certain variables can influence a team’s position at the end of a season. I also go to know more about the league as a whole and who the top performers were as well.