Topic of Interest

For this final project I decided to work with a topic that I have a passion for. Similar to assignment 6, I am working with soccer data. Previously, I focused on the Premier League (England), but for this project I thought I would learn more about soccer from the land of Oktoberfest. I chose to analyze the top four teams of the Bundesliga and what separated the first place team vs the second, third, and fourth placed teams. I was also curious to see who the standout players were during the 2019/2020 season.

Background

The 2019/2020 season of the Bundesliga resulted in Bayern Munchen being crowned champion. This outcome was no surprise to people who follow the Bundesliga. Bayern has been crowned champions of Germany 29 times, and have won the last 8 Bundesliga titles. They are a historic club that expects excellence. As of late, Dortmund, RB Leipzig, and Gladbach have been nipping at their heels but can’t quite catch the German giants. Below is a breakdown of their point totals for the 2019/2020 Season.

Bayern - 82 Dortmund - 69 RB Leipzig - 66 Gladbach - 65

Dataset

I used fbref.com to find team and player stats for each of the top four teams in the Bundesliga. I downloaded a csv file for each team and then merged the 4 datasets into a single file. I added a column to identify which team each player was on and deleted several variables that wouldn’t play a role in my analysis. I then uploaded the csv file to my OneDrive in order to host the data. Below is the link to the data I used for this project.

bundesliga <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/weberb2_xavier_edu/EYnAjfJsoAlGtDyUGyLNTLAB0A-U6rsllfZ_4NVwcb794g?download=1")

Introduction to the dataset

Data Dictionary

Variables Explanation
Player Name of the Player
Team Team the the player plays for
Nation Nation the player is from
Pos Position of the player
Age Age of player
MP Matches played
Starts Game or Games started by player
Min Minutes played
90s Minutes played / 90 (Full games played)
Gls Goals scored
Ast Assists registered
G-PK Non-Penalty goals
PK Penalty kicks scored
PKatt Penalty kicks attempted
CrdY Yellow cards earned
CrdR Red cards earned

Datatable

Summary Statistics

This table provides some of the most basic soccer statistics for each of the four teams. These statistics show which teams were exceptionally good at attacking and which were more disciplined.

Analysis

This section will include visualizations/tables that will compare teams to one another, as well as highlighting individuals players.

Visual 1 - Goals

Given this visual, we can see that Bayern scored the most goals out of the top 4 teams in Germany. With this being said, we can also see that the order in which each team finished relates to the amount of goals they scored. Bayern(1) scored the most and RB Leipzig(4) scored the least. We can assume that there is a correlation between goals scored and place finished.

Visual 2 - Assists

Soccer can be played in a variety of fashions and tactics. Assists are a good way of measuring the style of a play of a team. Teams with a lot of assists tend to share the ball and play a possession oriented brand of soccer. Given these results, we can see that both Bayern and Dortmund have a considerable amount of assists compared to Gladbach and RB Leipzig. With this being said, we can assume that Bayern and Dortmund soccer identities are to play a possession based game, while Gladbach and RB Leipzig score more goals unassisted. Their style of play could be considered more direct vs possession oriented.

Visual 3 - Discipline

Team Cards Earned
Bayern 57
Dortmund 50
Gladbach 80
RB Leipzig 56

A method to look at the discipline of a team is to aggregate their red and yellow cards. Yellow cards are given out for an accumulation of fouls or a minor violent conduct. While red cards are given out for major violent conducts or two yellow cards. Typically, winning teams are discipline teams. In reference to the table above, we can see that Bayern seems to have an average amount of total cards. It would make sense that the top 4 teams would have a similar amount of cards since they are the top four best teams of the league. The one outlier here is Gladbach who had a considerable amount more.

Visual 4 - Age vs. Minutes Played

This scatter plot is meant to help identify if there is a correlation between age and minutes played. There doesn’t seem to be much of a correlation between these two variables. The only thing we can conclude from this plot is that the youngest players tend to not get much playing time, if any.

Visual 5 - Goal Scorers

Player Goals Team
Robert Lewandowski 34 Bayern
Timo Werner 28 RB Leipzig
Jadon Sancho 17 Dortmund
Erling Haaland 13 Dortmund
Serge Gnabry 12 Bayern
Marco Reus 11 Dortmund
Alassane Pléa 10 Gladbach
Marcus Thuram 10 Gladbach
Patrik Schick 10 RB Leipzig
Lars Stindl 9 Gladbach
Marcel Sabitzer 9 RB Leipzig

The first visual provides a quick look at the top four goal scorers during the 2019/2020 season. The colors within the graph relate to the team they play for. This is helpful to see which attacking players were the most effective. Robert Lewndowski helped Bayern win this season by providing nearly 35 goals in total.

The table provides a look at the top ten goal scorers in the league. This helps provide a broader view of which attackers were able to find the net the most.

Visual 6 - Play-Makers

Player Assists Team
Thomas Müller 21 Bayern
Jadon Sancho 16 Dortmund
Christopher Nkunku 13 RB Leipzig
Thorgan Hazard 13 Dortmund
Achraf Hakimi 10 Dortmund
Alassane Pléa 10 Gladbach
Serge Gnabry 10 Bayern
Marcus Thuram 8 Gladbach
Timo Werner 8 RB Leipzig
Joshua Kimmich 7 Bayern
Julian Brandt 7 Dortmund
Marcel Sabitzer 7 RB Leipzig

Similar to the previous visual, this bar chart provides a quick look at the players who had the most assists during the season. This shows who the top play-makers of the Bundesliga were for the 2019/2020 season. Thomas Muller, Bayern’s attacking mid, found himself involved in many goals and helped lead them to another Bundesliga title.

The table provides a look at the top assist men in the league. This helps provide a broader view of which play-makers were able to help their teammates score throughout the season.

Secondary Data

The secondary data that I am using to help complement my analysis is the overall league data for each team within the Bundesliga. I scraped an HTML table from an website and did some data cleaning within the script I used to scrape the table. I then uploaded the scraped table as a csv to my OneDrive to host the data. The data within this table includes variables such as team, matches played goals for, goals against, goal differential, points earned, and several more. The following visualizations should serve to analyze the Bundesliga further.

Goals Against & Points

This scatter plot helps show the relationship between goals against and points earned by each team. Clearly there is a negative correlation between the two variables. The fewer goals a team allows the more points they will end up earning. In this case, Bayern let in the fewest goals and as a result, they earned the most points and won the Bundesliga title.

Comparison Visuals

I will use two different visuals to compare my primary data with my secondary data source. ### Goals Scored - Primary Data
Team Goals For
Bayern 97
Dortmund 84
Gladbach 64
RB Leipzig 81
Squad Goals Against
Bayern Munich 32
Dortmund 41
M’Gladbach 40
RB Leipzig 37

These two tables help show how productive the top four teams were offensively, and how stable their defense was throughout the season. These two measures help show how Bayern won the Bundesliga title. They scored the most goals and allowed the fewest.

Predictive Analysis

I plan on creating a linear regression model that will show how the majority of the variables effect the points earned by each team. I will look at the following variables: wins, draws, goals for, and goals against.

## 
## Call:
## lm(formula = Pts ~ W + D + GF + GD, data = league_stats)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -9.622e-15 -1.371e-15  3.940e-17  4.702e-16  1.538e-14 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -2.010e-14  5.336e-14 -3.770e-01    0.713    
## W            3.000e+00  1.763e-15  1.702e+15   <2e-16 ***
## D            1.000e+00  1.122e-15  8.912e+14   <2e-16 ***
## GF           9.571e-17  4.672e-16  2.050e-01    0.841    
## GD           3.404e-17  5.528e-16  6.200e-02    0.952    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.185e-15 on 13 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.125e+31 on 4 and 13 DF,  p-value: < 2.2e-16

Given this linear model there are several things we can conclude. First off, the r-squared values of 1 tell us that the regression predictions perfectly fit the data. Secondly, we can see that two variables were significant, and two were insignificant. We can tell this by looking at the p-values of each variable. Both wins and draws seem to be significant given their p-values are less than .05, while goals for and goal differential seem to be insignificant given their p-values are above .05.

Wrap Up Visual

This scatter plot is meant to individually visualize the correlation between wins and points earned. Clearly there is a positive relationship between the two variables.

Conclusion

This project helped show how certain variables can influence a team’s position at the end of a season. I also go to know more about the league as a whole and who the top performers were as well.