jc3181 AT columbia DOT edu

 

This is a quick demonstration of the data contained in the spainliga dataset in my engsoccerdata package. I realize now that calling it engsoccerdata was a mistake as I am now including European soccer data too. I think it’s better to add new datasets to this existing package rather than creating a new package.

As of writing, the spainliga dataset contains all results (including half-time results) from the top flight of La Liga, from its inception in the 1928/1929 season (even though all games took place in 1929 of that season) through to the end of the 2013/14 season. It does not include relegation playoffs that mostly included Liga Segunda teams when they have occurred. It’s important to note that I use the most recent team name of all teams - i.e. ‘FC Barcelona’ rather than ‘CF Barcelona’, ‘Real Sociedad’ rather than ‘Donostia’.

 


Getting started

First install my engsoccerdata package from GitHub if you haven’t already. Make sure you have the devtools package loaded:

library(devtools)
install_github('jalapic/engsoccerdata', username = "jalapic")

 

Now load the required packages. In addition to engsoccerdata, we are also using dplyrto flexibly restructure our data and ggplot2 for visualizing it.

library(engsoccerdata)
library(dplyr)
library(ggplot2)

 

The data

The dataset we are using is spainliga . We can look at it here:

df <- spainliga
head(df)
##         date Season               home         visitor  HT  FT hgoal vgoal
## 1 1929-02-10   1928    Arenas de Getxo Atlético Madrid 0-2 2-3     2     3
## 2 1929-02-10   1928 Espanyol Barcelona      Real Unión 1-0 3-2     3     2
## 3 1929-02-10   1928        Real Madrid       CE Europa 0-0 5-0     5     0
## 4 1929-02-10   1928      Real Sociedad Athletic Bilbao 1-1 1-1     1     1
## 5 1929-02-12   1928   Racing Santander    FC Barcelona 0-0 0-2     0     2
## 6 1929-02-17   1928       FC Barcelona     Real Madrid 0-1 1-2     1     2
##   tier  round group notes
## 1    1 league  <NA>  <NA>
## 2    1 league  <NA>  <NA>
## 3    1 league  <NA>  <NA>
## 4    1 league  <NA>  <NA>
## 5    1 league  <NA>  <NA>
## 6    1 league  <NA>  <NA>

Most variables are self-explanatory. The ‘round’ variable refers to whether the game was a typical league game or part of the second phase of matches that took place in 1986/87 when the league was split into three separate mini-leagues after 18 matches had been completed. The ‘group’ variable refers to these different mini-leagues. The ‘notes’ variable contains one piece of information regarding a match in the 1979/80 season - “CD Málaga 0-1 AD Almería” - that was not played but awarded 0-1 to AD Almería as CD Málaga failed to participate.

table(df$round)
## 
## league phase2 
##  23065     90
table(df$group)
## 
##  A  B  C 
## 30 30 30

 

Getting summary data

To look at all time performances in La Liga, we can summarize the data using dplyr. First we need to have a separate observation for every team’s game. This means we need to double the dataframe, with each match now having two observations (one for each team involved). We can then simply calculate the goals for (GF) and against (GA) for each team using the value in the appropriate variable of the original match observations e.g. for GF - (‘hgoal’ if team was the home team and ‘vgoal’ if team was the away team). Goal difference can then be calculated also.

temp <-
  rbind(
df %>% select(Season, team=home, opp=visitor, GF=hgoal, GA=vgoal),
df %>% select(Season, team=visitor, opp=home, GF=vgoal, GA=hgoal)
) #rbind two copies of the orignal df, simply reversing home/away team for each match

temp$GF <- as.numeric(temp$GF) #make sure is numeric
temp$GA <- as.numeric(temp$GA) #make sure is numeric
temp <- temp %>% mutate(GD = GF-GA)
head(temp)
##   Season               team             opp GF GA GD
## 1   1928    Arenas de Getxo Atlético Madrid  2  3 -1
## 2   1928 Espanyol Barcelona      Real Unión  3  2  1
## 3   1928        Real Madrid       CE Europa  5  0  5
## 4   1928      Real Sociedad Athletic Bilbao  1  1  0
## 5   1928   Racing Santander    FC Barcelona  0  2 -2
## 6   1928       FC Barcelona     Real Madrid  1  2 -1

  Next, we can get a summary for each team of their total Season in la liga, total games played, total wins, draws, losses as well as their cumulative goals scored, against and goal difference.

temp1<-
temp %>% group_by(team) %>%
  summarize(GP = n(),
            goalsF = sum(GF),
            goalsA = sum(GA),
            goaldif = sum(GD),
            W = sum(GD>0),
            D = sum(GD==0),
            L = sum(GD<0)
  )

temp1
## Source: local data frame [59 x 8]
## 
##               team   GP goalsF goalsA goaldif    W   D   L
## 1       AD Almería   68     72    116     -44   17  18  33
## 2         Albacete  270    320    411     -91   76  76 118
## 3  Arenas de Getxo  130    227    308     -81   43  21  66
## 4  Athletic Bilbao 2648   4478   3571     907 1157 609 882
## 5  Atletico Tetuan   30     51     85     -34    7   5  18
## 6  Atlético Madrid 2500   4334   3235    1099 1167 576 757
## 7        Burgos CF  204    216    310     -94   59  50  95
## 8       CA Osasuna 1278   1458   1739    -281  421 316 541
## 9        CD Alavés  342    417    585    -168  111  68 163
## 10     CD Alcoyano  108    145    252    -107   30  16  62
## ..             ...  ...    ...    ...     ...  ... ... ...

Next I’m just setting up a ggplot theme to make the graph look pretty…

### just setting my theme
mytheme <- theme(
    plot.title = element_text(hjust=0,vjust=1, size=rel(1.7)),
    panel.background = element_blank(),
    panel.grid.major.y = element_line(color="gray65"),
    panel.grid.major.x = element_line(color="gray65"),
    panel.grid.minor = element_blank(),
    plot.background  = element_blank(),
    text = element_text(color="gray20", size=10),
    axis.text = element_text(size=rel(1.0)),
    axis.text.x = element_text(color="gray20",size=rel(1.5)),
    axis.text.y = element_text(color="gray20", size=rel(1.5)),
    axis.title.x = element_text(size=rel(1.5), vjust=0),
    axis.title.y = element_text(size=rel(1.5), vjust=1),
    axis.ticks.y = element_blank(),
    axis.ticks.x = element_blank(),
    legend.position = "none"
  )

 

 

Here is the plot…

 ggplot(temp1, aes(GP,goaldif)) + geom_point(size=4, color="firebrick1") + mytheme +
   xlab("Games played in La Liga")+
   ylab("Cumulative goal difference")+
   ggtitle("All time goal difference in La Liga")

 

 

With a little manual editing to add labels to some interesting data points, it looks like this:

 

 

Season Results Matrix

I thought that it would be interesting to make a results matrix for a particular season. Most results matrices are ordered in alphabetical order, but I thought it’d be better to order them by final league standings. Visually, this should give a better indicator of team strength.

I managed to make this work, but with a bit more manual editing than I typically like.

  First of all let’s get the data for La liga in 2013/14 and make a result column that depicts home wins, away wins and draws.

df <- df %>% filter(Season==2013)
df <- df %>% 
          select(home,visitor,FT,hgoal,vgoal) %>% 
          mutate(GD=hgoal-vgoal,
                 result = ifelse(GD>0, "H", ifelse(GD<0, "A", "D"))
          )

head(df)
##              home         visitor  FT hgoal vgoal GD result
## 1   Real Sociedad       Getafe CF 2-0     2     0  2      H
## 2 Real Valladolid Athletic Bilbao 1-2     1     2 -1      A
## 3     Valencia CF       Málaga CF 1-0     1     0  1      H
## 4    FC Barcelona      Levante UD 7-0     7     0  7      H
## 5     Real Madrid      Real Betis 2-1     2     1  1      H
## 6      CA Osasuna      Granada CF 1-2     1     2 -1      A

 

Next, we make the final table. I am sorting this table by total points. The tie-breaker for teams on equal points is actually head-to-head performance. For this quick worked example, I am just going to manually enter league positions for those teams that are tied on points. It wouldn’t be too hard to write a function to automatically work out the head-to-head deciders.

temp <-
  rbind(
    df %>% select(team=home, opp=visitor, GF=hgoal, GA=vgoal),
    df %>% select(team=visitor, opp=home, GF=vgoal, GA=hgoal)
  ) #rbind two copies of the orignal df, simply reversing home/away team for each match

temp1<-
  temp %>%
  mutate(GD = GF-GA) %>%
  group_by(team) %>%
  summarize(GP = n(),
            gf = sum(GF),
            ga = sum(GA),
            gd = sum(GD),
            W = sum(GD>0),
            D = sum(GD==0),
            L = sum(GD<0)
            ) %>%
  mutate(Pts = (W*3) + D) %>%
  arrange(desc(Pts))

temp1 <- temp1 %>% mutate(pos = rank(desc(Pts)))
temp1
## Source: local data frame [20 x 10]
## 
##                  team GP  gf ga  gd  W  D  L Pts  pos
## 1     Atlético Madrid 38  77 26  51 28  6  4  90  1.0
## 2        FC Barcelona 38 100 33  67 27  6  5  87  2.5
## 3         Real Madrid 38 104 38  66 27  6  5  87  2.5
## 4     Athletic Bilbao 38  66 39  27 20 10  8  70  4.0
## 5          Sevilla FC 38  69 52  17 18  9 11  63  5.0
## 6       Real Sociedad 38  62 55   7 16 11 11  59  6.5
## 7       Villarreal CF 38  60 44  16 17  8 13  59  6.5
## 8          Celta Vigo 38  49 54  -5 14  7 17  49  8.5
## 9         Valencia CF 38  51 53  -2 13 10 15  49  8.5
## 10         Levante UD 38  35 43  -8 12 12 14  48 10.0
## 11          Málaga CF 38  39 46  -7 12  9 17  45 11.0
## 12     Rayo Vallecano 38  46 80 -34 13  4 21  43 12.0
## 13 Espanyol Barcelona 38  41 51 -10 11  9 18  42 13.5
## 14          Getafe CF 38  35 54 -19 11  9 18  42 13.5
## 15         Granada CF 38  32 56 -24 12  5 21  41 15.0
## 16           Elche CF 38  30 50 -20  9 13 16  40 16.5
## 17         UD Almería 38  43 71 -28 11  7 20  40 16.5
## 18         CA Osasuna 38  32 62 -30 10  9 19  39 18.0
## 19    Real Valladolid 38  38 60 -22  7 15 16  36 19.0
## 20         Real Betis 38  36 78 -42  6  7 25  25 20.0

 

temp2 <- temp1 %>% select(team,pos)

#manually edit tied teams - referred to a published final standings table
temp2[2,2] <- 2
temp2[3,2] <- 3
temp2[7,2] <- 6
temp2[6,2] <- 7
temp2[9,2] <- 8
temp2[8,2] <- 9
temp2[14,2] <- 13
temp2[13,2] <- 14
temp2[16,2] <- 16
temp2[17,2] <- 17

temp2 <- temp2 %>% arrange(pos)
temp2
## Source: local data frame [20 x 2]
## 
##                  team pos
## 1     Atlético Madrid   1
## 2        FC Barcelona   2
## 3         Real Madrid   3
## 4     Athletic Bilbao   4
## 5          Sevilla FC   5
## 6       Villarreal CF   6
## 7       Real Sociedad   7
## 8         Valencia CF   8
## 9          Celta Vigo   9
## 10         Levante UD  10
## 11          Málaga CF  11
## 12     Rayo Vallecano  12
## 13          Getafe CF  13
## 14 Espanyol Barcelona  14
## 15         Granada CF  15
## 16           Elche CF  16
## 17         UD Almería  17
## 18         CA Osasuna  18
## 19    Real Valladolid  19
## 20         Real Betis  20

  Next, we use the final standings to order the levels of the factors ‘home’ and ‘visitor’ in the ‘df’ dataframe. This enables us to accurately rank order the teams in the final matrix.

df$home <- factor(df$home, levels=rev(temp2$team))
levels(df$home)
##  [1] "Real Betis"         "Real Valladolid"    "CA Osasuna"        
##  [4] "UD Almería"         "Elche CF"           "Granada CF"        
##  [7] "Espanyol Barcelona" "Getafe CF"          "Rayo Vallecano"    
## [10] "Málaga CF"          "Levante UD"         "Celta Vigo"        
## [13] "Valencia CF"        "Real Sociedad"      "Villarreal CF"     
## [16] "Sevilla FC"         "Athletic Bilbao"    "Real Madrid"       
## [19] "FC Barcelona"       "Atlético Madrid"
df$visitor <- factor(df$visitor, levels=temp2$team)
levels(df$visitor)
##  [1] "Atlético Madrid"    "FC Barcelona"       "Real Madrid"       
##  [4] "Athletic Bilbao"    "Sevilla FC"         "Villarreal CF"     
##  [7] "Real Sociedad"      "Valencia CF"        "Celta Vigo"        
## [10] "Levante UD"         "Málaga CF"          "Rayo Vallecano"    
## [13] "Getafe CF"          "Espanyol Barcelona" "Granada CF"        
## [16] "Elche CF"           "UD Almería"         "CA Osasuna"        
## [19] "Real Valladolid"    "Real Betis"

Here is the code to make our final matrix:

ggplot(df, aes(home, visitor, fill = factor(result))) + 
  geom_tile(colour="gray20", size=1.5, family="bold", stat="identity", height=1, width=1) + 
  geom_text(data=df, aes(home, visitor, label = FT), color="black", size=rel(2)) +
  coord_flip() +
  scale_x_discrete(expand = c(0, 0)) +
  scale_y_discrete(expand = c(0, 0)) +
  xlab("") + 
  ylab("") +
  theme(
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_rect(fill=NA,color="gray20", size=0.5, linetype="solid"),
        axis.line = element_blank(),
        axis.ticks = element_blank(), 
        axis.text = element_text(color="white", size=rel(1)),
        panel.background = element_rect(fill="gray20"),
        plot.background = element_rect(fill="gray20"),
        legend.position = "none",
        axis.text.x  = element_text(angle=90, vjust=0.5, hjust=0)        
  ) 

This is nice, but it’s not as visually appealing to have the teamnames at the bottom on the x-axis. I’d prefer them on the top of the matrix. Unfortunately this doesn’t seem possible in ggplot2. See this question on stackoverflow that I previously asked. If anybody knows how to, I’d love to hear how to do it.

Therefore, I quickly manually edited to produce this:

 

I find these kinds of visualizations very informative. I think that they’d be even great to look at partial seasons results - for instance, using this type of visualization it’s possible to see if a team that is high in the standings is so because they haven’t played games against top teams.

Any questions or comments, please email me.