Introduction

"Most publicly available football (soccer) statistics are limited to aggregated data such as Goals, Shots, Fouls, Cards. When assessing performance or building predictive models, this simple aggregation, without any context, can be misleading. For example, a team that produced 10 shots on target from long range has a lower chance of scoring than a club that produced the same amount of shots from inside the box. However, metrics derived from this simple count of shots will similarly asses the two teams.

A football game generates much more events and it is very important and interesting to take into account the context in which those events were generated. This dataset should keep sports analytics enthusiasts awake for long hours as the number of questions that can be asked is huge.

Content

This dataset is a result of a very tiresome effort of webscraping and integrating different data sources. The central element is the text commentary. All the events were derived by reverse engineering the text commentary, using regex. Using this, I was able to derive 11 types of events, as well as the main player and secondary player involved in those events and many other statistics. In case I’ve missed extracting some useful information, you are gladly invited to do so and share your findings. The dataset provides a granular view of 9,074 games, totaling 941,009 events from the biggest 5 European football (soccer) leagues: England, Spain, Germany, Italy, France from 2011/2012 season to 2016/2017 season as of 25.01.2017. There are games that have been played during these seasons for which I could not collect detailed data. Overall, over 90% of the played games during these seasons have event data.

The dataset is organized in 3 files:

events.csv contains event data about each game. Text commentary was scraped from: bbc.com, espn.com and onefootball.com ginf.csv - contains metadata and market odds about each game. odds were collected from oddsportal.com dictionary.txt contains a dictionary with the textual description of each categorical variable coded with integers Past Research

I have used this data to:

create predictive models for football games in order to bet on football outcomes. make visualizations about upcoming games build expected goals models and compare players Inspiration

There are tons of interesting questions a sports enthusiast can answer with this dataset. For example:

What is the value of a shot? Or what is the probability of a shot being a goal given it’s location, shooter, league, assist method, gamestate, number of players on the pitch, time - known as expected goals (xG) models When are teams more likely to score? Which teams are the best or sloppiest at holding the lead? Which teams or players make the best use of set pieces? In which leagues is the referee more likely to give a card? How do players compare when they shoot with their week foot versus strong foot? Or which players are ambidextrous? Identify different styles of plays (shooting from long range vs shooting from the box, crossing the ball vs passing the ball, use of headers) Which teams have a bias for attacking on a particular flank? And many many more…" https://www.kaggle.com/secareanualin/football-events/home

event_type 0 Announcement 1 Attempt 2 Corner 3 Foul 4 Yellow card 5 Second yellow card 6 Red card 7 Substitution 8 Free kick won 9 Offside 10 Hand ball 11 Penalty conceded

event_type2 12 Key Pass 13 Failed through ball 14 Sending off 15 Own goal

side 1 Home 2 Away

shot_place 1 Bit too high 2 Blocked 3 Bottom left corner 4 Bottom right corner 5 Centre of the goal 6 High and wide 7 Hits the bar 8 Misses to the left 9 Misses to the right 10 Too high 11 Top centre of the goal 12 Top left corner 13 Top right corner

shot_outcome 1 On target 2 Off target 3 Blocked 4 Hit the bar

location 1 Attacking half 2 Defensive half 3 Centre of the box 4 Left wing 5 Right wing 6 Difficult angle and long range 7 Difficult angle on the left 8 Difficult angle on the right 9 Left side of the box 10 Left side of the six yard box 11 Right side of the box 12 Right side of the six yard box 13 Very close range 14 Penalty spot 15 Outside the box 16 Long range 17 More than 35 yards 18 More than 40 yards 19 Not recorded

bodypart 1 right foot 2 left foot 3 head

assist_method 0 None 1 Pass 2 Cross 3 Headed pass 4 Through ball

situation 1 Open play 2 Set piece 3 Corner 4 Free kick

i. Libraries

library('tidyverse')
library('ggplot2')

ii. load data and drop text column

events <- read_csv('events.csv') %>%
  select(-text)

iii. explore the dataset

glimpse(events)
## Observations: 941,009
## Variables: 21
## $ id_odsp       <chr> "UFot0hit/", "UFot0hit/", "UFot0hit/", "UFot0hit...
## $ id_event      <chr> "UFot0hit1", "UFot0hit2", "UFot0hit3", "UFot0hit...
## $ sort_order    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ time          <int> 2, 4, 4, 7, 7, 9, 10, 11, 11, 13, 14, 14, 14, 17...
## $ event_type    <int> 1, 2, 2, 3, 8, 10, 2, 8, 3, 3, 8, 1, 3, 1, 1, 3,...
## $ event_type2   <int> 12, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 12, ...
## $ side          <int> 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, ...
## $ event_team    <chr> "Hamburg SV", "Borussia Dortmund", "Borussia Dor...
## $ opponent      <chr> "Borussia Dortmund", "Hamburg SV", "Hamburg SV",...
## $ player        <chr> "mladen petric", "dennis diekmeier", "heiko west...
## $ player2       <chr> "gokhan tore", "dennis diekmeier", "heiko wester...
## $ player_in     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ player_out    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ shot_place    <int> 6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 13, N...
## $ shot_outcome  <int> 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, NA...
## $ is_goal       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...
## $ location      <int> 9, NA, NA, NA, 2, NA, NA, 2, NA, NA, 4, 15, NA, ...
## $ bodypart      <int> 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA...
## $ assist_method <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, ...
## $ situation     <int> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA...
## $ fast_break    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

iv. The perfect game…home or away? This question sets out to answer whether a “perfect game” happens more often at home or away matches

# create a new column using mutate named side_ signifying whether a team is playing home or away.
# filter out those rows where either shot_outcome is missing, any foul play, and at least one goal is scored.
# create a plot with situation on the x-axis, faceted by assist method and the bars have been filled via side_
# to show the ratios of home and away play in regard to situation and assist in goal scoring games.
events %>%
  mutate(side_ = as.factor(ifelse(side==1,"Home", "Away"))) %>%
  filter(
    !is.na(shot_outcome),
    shot_outcome == 1,
    event_type != c(3,4,5,6,10,11),
    event_type2 != 14
    ) %>% 
  ggplot(aes(situation, fill=side_)) + 
    geom_bar(color='black', position = "dodge") + 
    facet_grid(~assist_method) +
    ggtitle("                                Team Play in a Perfect Game")

The definition of a “perfect game” in the eyes of most supporters is one where at least one goal is scored, that there is no foul play (foul play includes fouls, players receiving cards or being sent off) and the players play as a team. Fouls and goals have been controlled for, the variable factors are team play in regard to the assistance of other players, and whether this happens more at home or away games. This question is of importance to most loyal fans as they invest a lot of money, emotional energy, want to see good quality football and this information could help them choose which fixtures to attend. From prior reseach the question has not already been answered and the question is answerable from the data provided.

Key information

A “perfect game”" always happens at home.

Most goals come from a pass in open play in a “perfect game”. (situation == 1 == Open play & assist_method == 1 == Pass)

Corners and set piece play there is much less difference between home and away and team play. (situation == 2 or 3 == Set_piece == Free_kick)

Almost no goals were scored in a “perfect game” with no assist method. (situation == 1 == Open play & assist_method == 0 == None)