Preface

I am a women’s soccer fan and anxiously awaiting the 2023 FIFA Women’s World Cup. Luckily, I discovered Statsbomb and their package for R. Wow! I was amazed at the plots and game insights that were being made with Statsbomb’s R package.

Statsbomb graciously offers access to free data sets from historical leagues, tournaments, and events. I found the 2019 FIFA Women’s World Cup and decided to zero-in on the tournament.

The amount of data in the database is truly amazing. I would love to explore it more, but for this project, I knew I would need to narrow it down substantially. I decided to focus specifically on goals and shots and compare the top 8 teams of the tournament to the rest of the teams.

1. Research Questions

1.1. Are the Top Rating Teams shooting more?

1.2. Do the most successful international teams have a higher Shot Conversion Rate?

DEFINITION: The Shot Conversion Rate is the number of shots that turn into goals in a given amount of time expressed as a percentage.

2. Setting Up My Environment

Accessing and Working with StatsBomb Data in R is a great guide that I used. It shows how to install the StatsBombR package and how to work with their package. Another guide by StatsBomb is located in Github.

The SDMTools package relies on the rTools package in order to load and run correctly. Make sure that you have it installed in your RStudio before installing SDMTools. Instructions for loading can be found here, or here is a link to where I troubleshooted loading the package on the Postit Community Page.

#import libraries
library(tidyverse)
library(ggplot2)
library(dplyr)
library(devtools)
library(SDMTools)
library(StatsBombR)
library(knitr)
library(kableExtra)

3. Collecting Data/Cleaning

I imported the free competition data from StatsbombR and filtered it down to the FIFA 2019 Women’s World Cup. This included importing all the event data for each individual match of the tournament.

#I imported the Free Competitions data frame and created an object named "FreeComps" From StatsBomb's Free Competition function.
FreeComps<-FreeCompetitions()
 #I filtered down to the 2019 Women's World Cup, season_id 30. 
Womens2019WorldCup <-FreeComps %>%
  filter(season_id==30)
#I extracted the matches for the Women's 2019 World Cup.
Matches<-FreeMatches(Womens2019WorldCup)
## [1] "Whilst we are keen to share data and facilitate research, we also urge you to be responsible with the data. Please credit StatsBomb as your data source when using the data and visit https://statsbomb.com/media-pack/ to obtain our logos for public use."
#I created a dataframe, "StatsBombData" of the free data available from Statsbomb for the 2019 Women's World Cup.
StatsBombData <- free_allevents(MatchesDF = Matches, Parallel = T)
## [1] "Whilst we are keen to share data and facilitate research, we also urge you to be responsible with the data. Please credit StatsBomb as your data source when using the data and visit https://statsbomb.com/media-pack/ to obtain our logos for public use."
StatsBombData = allclean(StatsBombData)
#provides extra shot information.
StatsBombData <- shotinfo(StatsBombData)

3.1. 2019 Fifa Women’s World Cup Stats at Large

# I ran a formula to find the average number of shots taken by a team during a game in the tournament.
round(length(which(StatsBombData$type.name=="Shot"))/(52*2),1)
## [1] 12.6
#I ran a formula to find the average number of goals scored by a team in the tournament.
Matches%>%
  summarise(sum(home_score, away_score))/(52*2)
##   sum(home_score, away_score)
## 1                    1.403846
# i took the combined total shots for the whole tournament and divided it by the combined total goals scored for the whole tournament. I multiplied this by 100 and rounded to the nearest whole number to find the "Average Conversion Rate" for all teams for the length of the tournament.
average_conversion_rate <- round(146/1314 * 100,0)
#I created an object that was a list of averages.
Average_for_all_teams <- list("-", "average_all_teams", 12.6, 1.4, "11%")

4.Examining the Data

Research Questions

4.1. Are the Top Rating Teams shooting more?

In order to figure out which team was shooting more, I could either calculate the average total shots by each team for the entire length of the tournament or find the average number of shots taken by each team per match. I choose to do the latter, average number of shots taken by each team per match, because this number allows for a better understanding of what a team may be achieving on a match to match basis.

4.2. Do the most successful international teams have a higher Shot Conversion Rate?

I choose to calculate the Average Shot Conversion Rate on a match to match basis as well for the same reason.

I used a table to demonstrate the difference between the top 8 teams and compared that to the average for all 24 teams competing in the tournament.

# I calculated the total shots and total goals for each team and grouped them by team. I placed them in a data frame, "shot_goals".
shots_goals = StatsBombData %>%
   group_by(team.name) %>%
  summarise(total_shots = sum(type.name=="Shot", na.rm = TRUE),
           total_goals = sum(shot.outcome.name=="Goal", na.rm = TRUE))
#From the "shots_goals" data frame, I calculated the conversion rate and filtered the data down to the top 8 teams in the tournament. I also removed an unnecessary word, "Women's" from each team name. I removed the columns for total shots and total goals, so the data frame was left with the team name and conversion rate. I called this new data frame, "Conversion_Rate".  I also added a percentage symbol to the conversion rate.
Conversion_rate<-shots_goals%>%
  mutate(conversion_rate=total_goals/total_shots*100)%>%
  mutate(across(c('conversion_rate'), round, 0))%>%
  mutate(team.name = str_remove_all(team.name, " Women's"))%>%
   filter(team.name=="United States" | team.name=="Netherlands"| team.name=="Sweden"| team.name=="England"| team.name=="France"| team.name=="Germany"| team.name=="Italy"| team.name=="Norway")%>%
  select(-total_shots, -total_goals)

Conversion_rate$conversion_rate <- paste(Conversion_rate$conversion_rate, "%", sep = "")
# I created an object, "place". I used this place the ranking of the teams. I joined the column with my data frame and arranged it in ascending order based on place.
place <- c(4, "5-8 (in no particular order)", "5-8 (in no particular order)", "5-8 (in no particular order) ", 2, "5-8 (in no particular order)", 3, 1)
Conversion_Rate_Table_with_Place <- cbind(place, Conversion_rate)%>%
  arrange(place)
# I created a data frame called "every_team_shots_goals".  I gathered data from StatsBombData that I grouped by team.name and performed calculations on to find the average shots per game and average goals per game for each team. I removed the unnecessary word "Women's" from the the team name and filtered the teams down to the top 8. 
every_team_shots_goals = StatsBombData %>%
  group_by(team.name)%>%
  mutate(team.name = str_remove_all(team.name, " Women's"))%>%
  summarise(aver_shots = round(sum(type.name=="Shot", na.rm = TRUE)/n_distinct(match_id),1),
            aver_goals = round(sum(shot.outcome.name=="Goal", na.rm = TRUE)/n_distinct(match_id),1))%>%
  filter(team.name=="United States" | team.name=="Netherlands"| team.name=="Sweden"| team.name=="England"| team.name=="France"| team.name=="Germany"| team.name=="Italy"| team.name=="Norway")
  

#I joined "every_team_shots_goals" data frame with "Conversion_Rate_Table_with_Place" data frame.  I selected the column for "place" to be the first column on the data frame and the rest of the columns to be after.
Complete_Table <- every_team_shots_goals %>% inner_join(Conversion_Rate_Table_with_Place, by = c('team.name'))%>%
   select(place, everything())
#I added a row to the "Complete_Table" data frame containing the averages for all teams. I created a new data frame "Complete_Table1" that is "Complete_Table" data frame arranged based on place.




Complete_Table[nrow(Complete_Table)+1,] = list("-", "Average for All Teams", 12.6, 1.4, "11%")

Complete_Table1 = Complete_Table%>%
  arrange(place, "-")
   kbl(Complete_Table1, format = "html", caption = "<center><strong>Averages per Game of Top 8 Teams</strong></center>", col.names = c("Place", "Team", "Average Shots per Game", "Average Goals per Game", "Shot Conversion Rate"))%>%
     row_spec(1, bold = T, color = "black", background = "yellow")%>%
  row_spec(2:5, bold = T, color = "black", background = "deepskyblue")%>%
  kable_styling(c("striped", "bordered"))%>%
  collapse_rows(columns = 1,
                valign = "middle")
Averages per Game of Top 8 Teams
Place Team Average Shots per Game Average Goals per Game Shot Conversion Rate
Average for All Teams 12.6 1.4 11%
1 United States 18.6 3.6 19%
2 Netherlands 12.3 1.6 13%
3 Sweden 15.0 1.7 11%
4 England 12.7 1.9 15%
5-8 (in no particular order) France 18.2 2.0 11%
Germany 15.4 2.0 13%
Norway 13.2 1.8 14%
Italy 10.6 1.8 17%

4.1. Table Results. Interesting.

The United State’s conversion rate at 19% was significantly higher than the average of any other team. They also averaged more shots and goals on a match to match basis.

I wanted to reexamine the numbers. I decided to look specifically at the shots taken and goals scored by each team in each game.

4.2. Looking at shots taken by each team per game.

## Let's look at shots taken by each team per game.

#I extracted type.name = "shot", match_id,and possession_team.name from the StatsBomb dataframe. Because the StatsBomb dataframe does not have a match date column, I used match_id as a common key to link the Statsbomb data frame to the Matches data frame.  I extracted match_id and match date from the Matches data frame. I also removed the extra word, "Women's" from the possession_team.name column. I manually colored only the line for top 4 teams and left the other teams grey so that the chart would be easier to read.

shots_per_game <- StatsBombData%>%
  group_by(team = possession_team.name, match_id)%>%
  mutate(team = str_remove_all(team, "Women's"))%>%
  summarise(Shots = sum(shots_per_game = type.name == "Shot"))
## `summarise()` has grouped output by 'team'. You can override using the
## `.groups` argument.
#Extract the rows containing data for the top 8 teams.
shots_per_game<-shots_per_game[c(27:48, 59:65, 73:77, 88:94, 98:104),]

#Using the match_id, I found the date of each game from the "Matches" dataframe.

shots_match_date <- Matches %>%
  group_by(match_id, match_date) %>%
  select(match_id, match_date)
  
#I joined the above 2 dataframes with the common key, match_id.
shots_per_game_stats <- shots_match_date %>% inner_join(shots_per_game, by = c('match_id'))


#I plotted the matchdate on the x axis and the number of shots on the y. 

 
 

ggplot(data=shots_per_game_stats)+geom_point(size = 3, mapping=aes(x=match_date, y=Shots, color = team))+
  xlab("Game Date")+
  ylab("Number of Shots")+
  labs(title = "Number of Shots per Match Date",
       caption = "Source: StatsBomb")+
   scale_x_discrete(guide = guide_axis(angle = 90))+
        scale_color_manual(values=c("green",
                              "grey",
                              "grey",
                              "grey",
                              "red",
                              "grey",
                              "yellow",
                              "blue4"))+
  geom_line(aes(x = match_date, y = Shots, group = team, color = team))

Each team took a wide variety of shots per game.

The matches the United States played at in the beginning of tournament offered them more opportunities to shoot. As the tournament progressed, the United States had less shots per match. This is consistent with the United States playing tougher teams as they advanced.

4.3. Let’s look at the number of goals scored by each team per game.

#I extracted  "home_team.home_team_name, match_date, and home_score" from the StatsBomb data frame "Matches and arranged the rows by home_team.home_team_name. I also removed the unnecessary word, "Women's" from the home team name. I manually colored only the line for top 4 teams and left the other teams grey so that the chart would be easier to read.

 home_Goals_per_game <- Matches%>%
  arrange(home_team.home_team_name)%>%
  mutate(home_team.home_team_name = str_remove_all(home_team.home_team_name, "Women's"))%>%
  select(match_date, home_team.home_team_name, home_score)


#Rename the columns.
colnames(home_Goals_per_game) <- c ("game_date", "team", "goals")
#Extract the rows containing data for the top 8 teams.
home_Goals_per_game<-home_Goals_per_game[c(10:25, 38:40, 31:34, 46:48, 50:52),]

# I did the same with away teams, away goals, and game dates. This allowed me to account for all the games each team played in.

away_goals_per_game <- Matches %>%
  arrange(away_team.away_team_name)%>%
  mutate(away_team.away_team_name = str_remove_all(away_team.away_team_name, "Women's"))%>%
  select(match_date, away_team.away_team_name, away_score)
  
 colnames(away_goals_per_game) <- c ("game_date", "team", "goals")
 #Extract the rows containing data for the top 8 teams.
 away_goals_per_game<-away_goals_per_game[c(18:23, 29:31, 36:37, 43:46, 49:52),]
  
team_goals_for_every_game<-rbind(home_Goals_per_game, away_goals_per_game)



#I plotted the match date on the x axis and the number of goals on the y. 

 
 

ggplot(data=team_goals_for_every_game)+geom_point(size = 3, mapping=aes(x=game_date, y=goals, group = team, color = team))+
  xlab("Game Date")+
  ylab("Number of Goals")+
  labs(title = "Number of Goals per Match Date",
       caption = "Source: StatsBomb")+
   scale_x_discrete(guide = guide_axis(angle = 90))+
      scale_color_manual(values=c("green",
                              "grey",
                              "grey",
                              "grey",
                              "red",
                              "grey",
                              "yellow",
                              "blue4"))+
geom_line(aes(x = game_date, y = goals, group = team, color = team))

4.4. Possible Outlier? Let’s reexamine.

Wow! On June 11, 2019, the United States played a match where they scored 13 goals! This may explain their high shot conversion rate for the tournament.

Let’s break down the United States Game statistics for the tournament and reconfigure the Shot Conversion Rate for the United States while omitting the shots and goals from the first game on June 11th, 2019. What will happen to the United States’s Shot Conversion Rate? Will it be within average range without this game?

#I created a dataframe, us_shots_per_game, containing shots and match_id.  I found the number of shots the United State's Women's Team took per game and grouped the games by match_id. Match_id is a common key that I can use as an identifier to link the StatsBomb dataframe, the dataframe that holds data on shots, and the Matches dataframe, the dataframe that holds data on goals.


us_shots_per_game = StatsBombData %>%
  group_by(match_id) %>%
  filter(team.name == "United States Women's")%>%
  summarise(shot_per_game = sum(type.name == "Shot"))


#I created a dataframe, us_goals_per_game. I filtered for the United States Women's Team and grouped by match_id. I needed to include both away goals, home_score, and home goals, home_score.

us_goals_per_game = Matches %>%
  group_by(match_id, match_date) %>%
  filter(home_team.home_team_name == "United States Women's"| away_team.away_team_name == "United States Women's")%>%
  summarise(goals_scored = sum(home_score, away_score))
## `summarise()` has grouped output by 'match_id'. You can override using the
## `.groups` argument.
#I joined the above 2 dataframes with the common key, match_id.
 us_game_stats <- us_shots_per_game %>% inner_join(us_goals_per_game, by = c('match_id'))

 us_game_stats = us_game_stats %>%
   select(match_id, match_date, shot_per_game, goals_scored)
  
   kbl(us_game_stats, format = "html", caption = "<center><strong>United States Stats Per Game </strong></center>", col.names = c("Match ID", "Date", "Shots", "Goals"))%>%
     kable_styling(c("striped", "bordered"))%>%
     add_footnote(c("Source: Statsbomb"))
United States Stats Per Game
Match ID Date Shots Goals
22943 2019-06-11 39 13
22974 2019-06-16 28 3
68345 2019-06-20 16 2
69161 2019-06-24 11 3
69202 2019-06-28 10 3
69258 2019-07-02 11 3
69321 2019-07-07 15 2
a Source: Statsbomb

Now, I want to omit the June 11th, 2019 game and recalculate the United States’s Shot Conversion Rate.

#In the us_game_stats dataframe, I  subtracted the 13 goals from the goal_scored column and the 39 shots from the shot_per_game column. This omitted the data from the first game. I then recalculated the conversion rate.
us_game_stats%>%
  summarise(us_goals_minus_1 = sum(goals_scored)-13,
            (us_shots_minus_1 = sum(shot_per_game)-39))%>%
  mutate(usa_new_conversion_rate = round(16/91*100,1))
## # A tibble: 1 × 3
##   us_goals_minus_1 (us_shots_minus_1 = sum(shot_per_gam…¹ usa_new_conversion_r…²
##              <dbl>                                  <dbl>                  <dbl>
## 1               16                                     91                   17.6
## # ℹ abbreviated names: ¹​`(us_shots_minus_1 = sum(shot_per_game) - 39)`,
## #   ²​usa_new_conversion_rate

Even after excluding the June 11th match, the United State’s Shot Conversion Rate is 17.6%. This is still significantly higher than the 11% average.

5. Findings

5.1. Research Questions

5.1.1. Are the Top Rating Teams shooting more?

#I made a bar graph in ascending order of average shots per match for each team. I manually changed the color of the bars so the the average for all teams and the United States would stand out.
ggplot(data = Complete_Table, aes(x=reorder(team.name, aver_shots), y = aver_shots, fill = team.name))+
  geom_bar(stat = 'identity', show.legend = FALSE)+
   scale_fill_manual(values=c("Italy" = "#0099FF",
                              "Average for All Teams" = "red",
                              "Netherlands" = "#0099FF",
                              "England" = "#0099FF",
                              "Norway" = "#0099FF",
                              "Sweden" = "#0099FF",
                              "Germany" = "#0099FF",
                              "France" = "#0099FF",
                              "United States" = "darkblue"))+
  scale_x_discrete(labels=c("Italy", "2nd Place Netherlands", "Combined Avg. for All Teams",  "4th Place England", "Norway", "3rd Place Sweden", "Germany", "France", "1st Place United States"), guide = guide_axis(angle = 45))+
  labs(title = "Team Average Shots Per Match", subtitle = "2019 FIFA Women's World Cup",
       caption = "Source: StatsBomb")+
  ylab("Avg Shots per Match")+
  theme(
    axis.title.x = element_blank())

The top 8 placing teams in the 2019 FIFA Women’s World Cup mostly shot more per match on average than the rest of the 24 teams in the tournament.

Notably, the United States, the champions, had the largest shot average per match with 18.6 shots per match.
Although, it is also worth noting, the Netherlands, in 2nd Place, had a slightly lower shot average per match than average.

5.1.2. Do the most successful international teams have a higher Shot Conversion Rate?

#I created a bar graph showing the shot conversion rate for eact team and ordered them in ascending order. I included the United States's shot conversion rate that omitted June 11th's Match to visualize the difference this made. I also manually changed the colors of the bars to make the graph easier to read.
new_row = c(place = 1, team.name = "United States Omitting June 11th Game", aver_shots = 15, aver_goals = 2.6, conversion_rate = 17.6)
Complete_Table_new = rbind(Complete_Table,new_row)
Complete_Table_new$conversion_rate<-gsub("%","", as.character(Complete_Table_new$conversion_rate))
Complete_Table_new$conversion_rate = as.numeric(as.character(Complete_Table_new$conversion_rate))


ggplot(data = Complete_Table_new, aes(x=reorder(team.name, conversion_rate), y = conversion_rate, fill = team.name))+
  geom_bar(stat = 'identity', show.legend = FALSE)+
  scale_fill_manual(values=c("Average for All Teams" = "red",
                             "Italy" = "#0099FF",
                             "England" = "#0099FF",
                             "France" = "#0099FF",
                             "Germany" = "#0099FF",
                             "Netherlands" = "#0099FF",
                             "Norway" = "#0099FF",
                             "Sweden" = "#0099FF",
                             "United States" = "darkblue",
                             "United States Omitting June 11th Game" = "darkblue"))+ 
  scale_x_discrete(labels=c("Combined Avg. All Teams", "3rd Place Sweden", "France", "Germany", "2nd Place Netherlands", "Norway", "4th Place England", "Italy", "United States Omitting June 11th Match", "1st Place United States"), guide = guide_axis(angle = 45))+
  labs(title = "Average Shot Conversion Rate per Match", subtitle = "2019 FIFA Women's World Cup",
       caption = "Source: StatsBomb")+
  ylab("Shot Conversion Rate (%)")+
  theme(
    axis.title.x = element_blank())                                 

The top 8 teams of the tournament are converting shots into goals at a higher percentage.

Even after omitting the United States’s June 11th 2019 match, the United States had a shot conversion rate that was 1.7 times greater than the combined average for all teams.

6. Conclusion - Insights, Limitations, and More to Explore

6.1. Insights

6.1.1. The champions of the 2019 FIFA Women’s World Cup had the most shots on average per match and the highest average shot conversion rate per match.

6.1.2. Most of the top teams shot more per match than average. However, the Netherlands was able to earn 2nd place with a shot average per match that was just under the overall average for all the teams combined.

6.1.3. All of the teams were converting shots to goals at a higher percentage than the overall average. The United States, the champions, were leading this with a shot conversion rate that was 1.7 times greater than the overall average.

6.2. Limitations

This is a glimpse of one tournament. Other matches and tournaments will show different results.

Team Shots per match and the Conversion Rate are 2 small metrics to measure the multi-dimensional game of soccer. There are numerous variables that need to be synchronize throughout a match and tournament in order for a team to be successful.

Each of these metrics could also be examined closer. For example, measuring shots-on-target may show different results than all shots taken.

6.3. More to Explore

6.3.1. Shots

An analysis on the quality of shots teams are taking would be interesting. This could include comparing teams’ amount of shots-on-target and where those shots are coming from. The analysis could also look at the pathway to the goal. Is it a direct shot between the shooter and the goalie? Or are there opposing players in the path of the shot?

6.3.2. Shot Conversion Rate

Not only could these metrics be compared between teams in one tournament, but it would also be interesting to do a longitudinal study and compare these metrics with successful Women’s soccer teams throughout time. How have they changed? Or, what is working?