This is a quick demonstration of the data contained in the spainliga
dataset in my engsoccerdata
package. I realize now that calling it engsoccerdata
was a mistake as I am now including European soccer data too. I think it’s better to add new datasets to this existing package rather than creating a new package.
As of writing, the spainliga
dataset contains all results (including half-time results) from the top flight of La Liga, from its inception in the 1928/1929 season (even though all games took place in 1929 of that season) through to the end of the 2013/14 season. It does not include relegation playoffs that mostly included Liga Segunda teams when they have occurred. It’s important to note that I use the most recent team name of all teams - i.e. ‘FC Barcelona’ rather than ‘CF Barcelona’, ‘Real Sociedad’ rather than ‘Donostia’.
First install my engsoccerdata
package from GitHub if you haven’t already. Make sure you have the devtools
package loaded:
library(devtools)
install_github('jalapic/engsoccerdata', username = "jalapic")
Now load the required packages. In addition to engsoccerdata
, we are also using dplyr
to flexibly restructure our data and ggplot2
for visualizing it.
library(engsoccerdata)
library(dplyr)
library(ggplot2)
The dataset we are using is spainliga
. We can look at it here:
df <- spainliga
head(df)
## date Season home visitor HT FT hgoal vgoal
## 1 1929-02-10 1928 Arenas de Getxo Atlético Madrid 0-2 2-3 2 3
## 2 1929-02-10 1928 Espanyol Barcelona Real Unión 1-0 3-2 3 2
## 3 1929-02-10 1928 Real Madrid CE Europa 0-0 5-0 5 0
## 4 1929-02-10 1928 Real Sociedad Athletic Bilbao 1-1 1-1 1 1
## 5 1929-02-12 1928 Racing Santander FC Barcelona 0-0 0-2 0 2
## 6 1929-02-17 1928 FC Barcelona Real Madrid 0-1 1-2 1 2
## tier round group notes
## 1 1 league <NA> <NA>
## 2 1 league <NA> <NA>
## 3 1 league <NA> <NA>
## 4 1 league <NA> <NA>
## 5 1 league <NA> <NA>
## 6 1 league <NA> <NA>
Most variables are self-explanatory. The ‘round’ variable refers to whether the game was a typical league game or part of the second phase of matches that took place in 1986/87 when the league was split into three separate mini-leagues after 18 matches had been completed. The ‘group’ variable refers to these different mini-leagues. The ‘notes’ variable contains one piece of information regarding a match in the 1979/80 season - “CD Málaga 0-1 AD AlmerÃa” - that was not played but awarded 0-1 to AD AlmerÃa as CD Málaga failed to participate.
table(df$round)
##
## league phase2
## 23065 90
table(df$group)
##
## A B C
## 30 30 30
To look at all time performances in La Liga, we can summarize the data using dplyr
. First we need to have a separate observation for every team’s game. This means we need to double the dataframe, with each match now having two observations (one for each team involved). We can then simply calculate the goals for (GF) and against (GA) for each team using the value in the appropriate variable of the original match observations e.g. for GF - (‘hgoal’ if team was the home team and ‘vgoal’ if team was the away team). Goal difference can then be calculated also.
temp <-
rbind(
df %>% select(Season, team=home, opp=visitor, GF=hgoal, GA=vgoal),
df %>% select(Season, team=visitor, opp=home, GF=vgoal, GA=hgoal)
) #rbind two copies of the orignal df, simply reversing home/away team for each match
temp$GF <- as.numeric(temp$GF) #make sure is numeric
temp$GA <- as.numeric(temp$GA) #make sure is numeric
temp <- temp %>% mutate(GD = GF-GA)
head(temp)
## Season team opp GF GA GD
## 1 1928 Arenas de Getxo Atlético Madrid 2 3 -1
## 2 1928 Espanyol Barcelona Real Unión 3 2 1
## 3 1928 Real Madrid CE Europa 5 0 5
## 4 1928 Real Sociedad Athletic Bilbao 1 1 0
## 5 1928 Racing Santander FC Barcelona 0 2 -2
## 6 1928 FC Barcelona Real Madrid 1 2 -1
Next, we can get a summary for each team of their total Season in la liga, total games played, total wins, draws, losses as well as their cumulative goals scored, against and goal difference.
temp1<-
temp %>% group_by(team) %>%
summarize(GP = n(),
goalsF = sum(GF),
goalsA = sum(GA),
goaldif = sum(GD),
W = sum(GD>0),
D = sum(GD==0),
L = sum(GD<0)
)
temp1
## Source: local data frame [59 x 8]
##
## team GP goalsF goalsA goaldif W D L
## 1 AD Almería 68 72 116 -44 17 18 33
## 2 Albacete 270 320 411 -91 76 76 118
## 3 Arenas de Getxo 130 227 308 -81 43 21 66
## 4 Athletic Bilbao 2648 4478 3571 907 1157 609 882
## 5 Atletico Tetuan 30 51 85 -34 7 5 18
## 6 Atlético Madrid 2500 4334 3235 1099 1167 576 757
## 7 Burgos CF 204 216 310 -94 59 50 95
## 8 CA Osasuna 1278 1458 1739 -281 421 316 541
## 9 CD Alavés 342 417 585 -168 111 68 163
## 10 CD Alcoyano 108 145 252 -107 30 16 62
## .. ... ... ... ... ... ... ... ...
Next I’m just setting up a ggplot
theme to make the graph look pretty…
### just setting my theme
mytheme <- theme(
plot.title = element_text(hjust=0,vjust=1, size=rel(1.7)),
panel.background = element_blank(),
panel.grid.major.y = element_line(color="gray65"),
panel.grid.major.x = element_line(color="gray65"),
panel.grid.minor = element_blank(),
plot.background = element_blank(),
text = element_text(color="gray20", size=10),
axis.text = element_text(size=rel(1.0)),
axis.text.x = element_text(color="gray20",size=rel(1.5)),
axis.text.y = element_text(color="gray20", size=rel(1.5)),
axis.title.x = element_text(size=rel(1.5), vjust=0),
axis.title.y = element_text(size=rel(1.5), vjust=1),
axis.ticks.y = element_blank(),
axis.ticks.x = element_blank(),
legend.position = "none"
)
Here is the plot…
ggplot(temp1, aes(GP,goaldif)) + geom_point(size=4, color="firebrick1") + mytheme +
xlab("Games played in La Liga")+
ylab("Cumulative goal difference")+
ggtitle("All time goal difference in La Liga")
With a little manual editing to add labels to some interesting data points, it looks like this:
I thought that it would be interesting to make a results matrix for a particular season. Most results matrices are ordered in alphabetical order, but I thought it’d be better to order them by final league standings. Visually, this should give a better indicator of team strength.
I managed to make this work, but with a bit more manual editing than I typically like.
First of all let’s get the data for La liga in 2013/14 and make a result column that depicts home wins, away wins and draws.
df <- df %>% filter(Season==2013)
df <- df %>%
select(home,visitor,FT,hgoal,vgoal) %>%
mutate(GD=hgoal-vgoal,
result = ifelse(GD>0, "H", ifelse(GD<0, "A", "D"))
)
head(df)
## home visitor FT hgoal vgoal GD result
## 1 Real Sociedad Getafe CF 2-0 2 0 2 H
## 2 Real Valladolid Athletic Bilbao 1-2 1 2 -1 A
## 3 Valencia CF Málaga CF 1-0 1 0 1 H
## 4 FC Barcelona Levante UD 7-0 7 0 7 H
## 5 Real Madrid Real Betis 2-1 2 1 1 H
## 6 CA Osasuna Granada CF 1-2 1 2 -1 A
Next, we make the final table. I am sorting this table by total points. The tie-breaker for teams on equal points is actually head-to-head performance. For this quick worked example, I am just going to manually enter league positions for those teams that are tied on points. It wouldn’t be too hard to write a function to automatically work out the head-to-head deciders.
temp <-
rbind(
df %>% select(team=home, opp=visitor, GF=hgoal, GA=vgoal),
df %>% select(team=visitor, opp=home, GF=vgoal, GA=hgoal)
) #rbind two copies of the orignal df, simply reversing home/away team for each match
temp1<-
temp %>%
mutate(GD = GF-GA) %>%
group_by(team) %>%
summarize(GP = n(),
gf = sum(GF),
ga = sum(GA),
gd = sum(GD),
W = sum(GD>0),
D = sum(GD==0),
L = sum(GD<0)
) %>%
mutate(Pts = (W*3) + D) %>%
arrange(desc(Pts))
temp1 <- temp1 %>% mutate(pos = rank(desc(Pts)))
temp1
## Source: local data frame [20 x 10]
##
## team GP gf ga gd W D L Pts pos
## 1 Atlético Madrid 38 77 26 51 28 6 4 90 1.0
## 2 FC Barcelona 38 100 33 67 27 6 5 87 2.5
## 3 Real Madrid 38 104 38 66 27 6 5 87 2.5
## 4 Athletic Bilbao 38 66 39 27 20 10 8 70 4.0
## 5 Sevilla FC 38 69 52 17 18 9 11 63 5.0
## 6 Real Sociedad 38 62 55 7 16 11 11 59 6.5
## 7 Villarreal CF 38 60 44 16 17 8 13 59 6.5
## 8 Celta Vigo 38 49 54 -5 14 7 17 49 8.5
## 9 Valencia CF 38 51 53 -2 13 10 15 49 8.5
## 10 Levante UD 38 35 43 -8 12 12 14 48 10.0
## 11 Málaga CF 38 39 46 -7 12 9 17 45 11.0
## 12 Rayo Vallecano 38 46 80 -34 13 4 21 43 12.0
## 13 Espanyol Barcelona 38 41 51 -10 11 9 18 42 13.5
## 14 Getafe CF 38 35 54 -19 11 9 18 42 13.5
## 15 Granada CF 38 32 56 -24 12 5 21 41 15.0
## 16 Elche CF 38 30 50 -20 9 13 16 40 16.5
## 17 UD Almería 38 43 71 -28 11 7 20 40 16.5
## 18 CA Osasuna 38 32 62 -30 10 9 19 39 18.0
## 19 Real Valladolid 38 38 60 -22 7 15 16 36 19.0
## 20 Real Betis 38 36 78 -42 6 7 25 25 20.0
temp2 <- temp1 %>% select(team,pos)
#manually edit tied teams - referred to a published final standings table
temp2[2,2] <- 2
temp2[3,2] <- 3
temp2[7,2] <- 6
temp2[6,2] <- 7
temp2[9,2] <- 8
temp2[8,2] <- 9
temp2[14,2] <- 13
temp2[13,2] <- 14
temp2[16,2] <- 16
temp2[17,2] <- 17
temp2 <- temp2 %>% arrange(pos)
temp2
## Source: local data frame [20 x 2]
##
## team pos
## 1 Atlético Madrid 1
## 2 FC Barcelona 2
## 3 Real Madrid 3
## 4 Athletic Bilbao 4
## 5 Sevilla FC 5
## 6 Villarreal CF 6
## 7 Real Sociedad 7
## 8 Valencia CF 8
## 9 Celta Vigo 9
## 10 Levante UD 10
## 11 Málaga CF 11
## 12 Rayo Vallecano 12
## 13 Getafe CF 13
## 14 Espanyol Barcelona 14
## 15 Granada CF 15
## 16 Elche CF 16
## 17 UD Almería 17
## 18 CA Osasuna 18
## 19 Real Valladolid 19
## 20 Real Betis 20
Next, we use the final standings to order the levels of the factors ‘home’ and ‘visitor’ in the ‘df’ dataframe. This enables us to accurately rank order the teams in the final matrix.
df$home <- factor(df$home, levels=rev(temp2$team))
levels(df$home)
## [1] "Real Betis" "Real Valladolid" "CA Osasuna"
## [4] "UD Almería" "Elche CF" "Granada CF"
## [7] "Espanyol Barcelona" "Getafe CF" "Rayo Vallecano"
## [10] "Málaga CF" "Levante UD" "Celta Vigo"
## [13] "Valencia CF" "Real Sociedad" "Villarreal CF"
## [16] "Sevilla FC" "Athletic Bilbao" "Real Madrid"
## [19] "FC Barcelona" "Atlético Madrid"
df$visitor <- factor(df$visitor, levels=temp2$team)
levels(df$visitor)
## [1] "Atlético Madrid" "FC Barcelona" "Real Madrid"
## [4] "Athletic Bilbao" "Sevilla FC" "Villarreal CF"
## [7] "Real Sociedad" "Valencia CF" "Celta Vigo"
## [10] "Levante UD" "Málaga CF" "Rayo Vallecano"
## [13] "Getafe CF" "Espanyol Barcelona" "Granada CF"
## [16] "Elche CF" "UD Almería" "CA Osasuna"
## [19] "Real Valladolid" "Real Betis"
Here is the code to make our final matrix:
ggplot(df, aes(home, visitor, fill = factor(result))) +
geom_tile(colour="gray20", size=1.5, family="bold", stat="identity", height=1, width=1) +
geom_text(data=df, aes(home, visitor, label = FT), color="black", size=rel(2)) +
coord_flip() +
scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
xlab("") +
ylab("") +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(fill=NA,color="gray20", size=0.5, linetype="solid"),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.text = element_text(color="white", size=rel(1)),
panel.background = element_rect(fill="gray20"),
plot.background = element_rect(fill="gray20"),
legend.position = "none",
axis.text.x = element_text(angle=90, vjust=0.5, hjust=0)
)
This is nice, but it’s not as visually appealing to have the teamnames at the bottom on the x-axis. I’d prefer them on the top of the matrix. Unfortunately this doesn’t seem possible in ggplot2
. See this question on stackoverflow that I previously asked. If anybody knows how to, I’d love to hear how to do it.
Therefore, I quickly manually edited to produce this:
I find these kinds of visualizations very informative. I think that they’d be even great to look at partial seasons results - for instance, using this type of visualization it’s possible to see if a team that is high in the standings is so because they haven’t played games against top teams.
Any questions or comments, please email me.