My area of interest in this project is soccer. As it is World Cup Season now, I wanted to dive deeper into analytics of the sport of soccer(football) and how different factors inside of a game are impactful. I am using a English Premier League (EPL) dataset from Kaggle with data on past matches played from 1993-2021 and look at how different variables can affect a game’s outcome, or a team’s season. If you would like to check out the dataset here is the link: (# https://www.kaggle.com/datasets/irkaal/english-premier-league-results)
## Rows: 24 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Variable Name, Variable Type, Description
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| Variable Name | Variable Type | Description |
|---|---|---|
| Season | String | Which Season the match was played in |
| DateTime | UNIX time | The date and time when the game was played |
| HomeTeam | String | Club name for the home team during the match |
| AwayTeam | String | Club name for the away team during the match |
| FTHG | Numeric | Full Time Home Team Goals (After the match has finished) |
| FTAG | Numeric | Full Time Away Team Goals (After the match has finished) |
| FTR | String | Full time Result (A = Away winning, H = Home winning, D = Draw |
| HTHG | Numeric | Halftime Home Team Goals (After the match has finished 1st Half) |
| HTAG | Numeric | Halftime Away Team Goals (After the match has finished 1st Half) |
| HTR | String | Halftime Result (A = Away winning, H = Home winning, D = Draw |
| Referee | String | Name of the head referee during the match |
| HS | Numeric | Home team shots during match |
| AS | Numeric | Away team shots during match |
| HST | Numeric | Home team shots on target during the match |
| AST | Numeric | Away team shots on target during the match |
| HC | Numeric | Home team corner kick’s |
| AC | Numeric | Away team corner kick’s |
| HF | Numeric | Home team fouls committed |
| AF | Numeric | Away team fouls committed |
| HY | Numeric | Home team yellow cards during the match |
| AY | Numeric | Away team yellow cards during the match |
| HR | Numeric | Home team red cards during the match |
| AR | Numeric | Away team red cards during the match |
| Weekday | String | States what day of week match was played |
## # A tibble: 1 × 6
## `Home Goals` `Away Goals` Shots `Shots on Target` Fouls Bookings(Yellow and …¹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.52 1.14 24.1 10.9 23.3 3.31
## # … with abbreviated variable name ¹`Bookings(Yellow and Red Cards)`
The Summary Statistics consists of all averages(per match) from matches that consisted of data from the 1993-94 season to 2021-22 season.
In this section of the project, I wanted to focus on referee analysis and how they effect the game in ways that you may not initially realize.
My first visualization section checks to see how referees from the past couple seasons impact the amount of goals scored. Navy is Home Goals and Tan is Away goals.
By completing this visualization I can clearly see that when J Brooks is the referee, 1.75 average away goals but only about 0.67 average home goals. Whereas when J Gillett is the referee about 2.2 home goals are scored and 1.15 average away goals are scored.
In the next visualization I wanted to look how fouls are usually distributed amongst referees.
The lower values like K Friend and J Gillett’s stats in the boxplot would signify that these referees doesn’t usually call as many fouls. If the boxplot values are higher like D Coote and C Pawson on the graph it signifies that these referees call a lot of fouls and the players need to be cautious or they may receive a yellow or red card early.
In this section I want to look specifically at things related to the in game factors of EPL matches.
I am taking the data starting from the 2010-11 season.I can see that most teams score more goals at a higher rate when at home field in comparison to on the road. Seeing teams like Everton and Man City having a good amount of multiple goals on the road shows that they can compete with anyone and are a dangerous team on the road historically based on data of goals scored. Understand that there are a few teams that have more horizontal looking lines which indicates that they haven’t stayed in the EPL very long. EPL is the highest division of English soccer and every year the bottom 3 teams get relegated. Seeing a bigger proportion of counts indicates that the team has stayed in the EPL for a while.
This next plot will analyze the density of corner distribution vs shots on target distribution.
From a coaching output, it is good to see a key similarity in these two density plots knowing that getting shots on target will result in corner set pieces which are a good way to get a free header or shot on the goal that could impact a win. Also knowing that getting corners that maybe deflect off of defender erros can end up “buying” the team more chances and shots on goal.
The next plot, I wanted to bring date into effect by analyzing how the day of the week impacts the game winner or draw.
This plot shows whether the Home or Away team wins or draws more on a particular day. It also can indirectly show when the majority of games are played during the week. On Saturday’s the home teams usually have a clear advantage based on this stat, however on Sunday’s it is a little more exciting for away teams as they are within a closer reach of games against home teams.
My intent of scraping this data is to figure out how each stadium is described. I have only selected a specific part of the description to scrape from to not overload the website and be courteous of their website. I am going to scrape from a stadium analysis website and perform sentiment analysis on the Tottenham Hotspur Stadium.
## Joining, by = "word"
After scraping the data on one of the newer renovated stadiums in EPL, I can see that the sentiments of the stadium stay very far right with a overall good score. The most common word is the structure and it aligns with a trustworthy sentiment and positive sentiment. Ground and land also come up with positive and trustworthy sentiments, knowing that all EPL ground is actual real grass this makes sense as the pitch is kept very smooth. By looking at the plot earlier I can also see that Tottenham Hotspurs typically play well at home and score over the average amount of goals at home.