jc3181 AT columbia DOT edu

 

In this guide I’ll show you how to create simple charts that track the cumulative goals scored and conceded and points gained over the course of different seasons. We can do this very easily using my R package engsoccerdata that contains the date and result of every league game ever played.

 


Getting started

First install my engsoccerdata package from GitHub if you haven’t already. Make sure you have the devtools package loaded:

library(devtools)
install_github('jalapic/engsoccerdata', username = "jalapic")

 

Now load the required packages. In addition to engsoccerdata, we are also using dplyr to flexibly restructure our data and ggplot2 for visualizing it. We shall also be using XML to do some web-scraping for 2014/15 data. tidyr is used for it’s separate function.

library(engsoccerdata)
library(dplyr)
library(ggplot2)
library(XML)
library(tidyr)

 

Restructuring data

The dataset to use is engsoccerdata2 - this contains all league results up to the end of the 2013/14 season. Each season in this dataset is referred to by the year that the season began in. i.e. 2013 refers to the 2013/14 season. This is what it looks like after adding an extra variable in the appropriate date format:  

df <- engsoccerdata2
df$date <- as.Date(df$Date, format="%Y-%m-%d")
tail(df)
##              Date Season      home           visitor  FT hgoal vgoal
## 188055 2013-09-28   2013 York City        Portsmouth 4-2     4     2
## 188056 2013-11-30   2013 York City          Rochdale 0-0     0     0
## 188057 2013-10-29   2013 York City Scunthorpe United 4-1     4     1
## 188058 2014-02-22   2013 York City   Southend United 0-0     0     0
## 188059 2014-03-25   2013 York City    Torquay United 1-0     1     0
## 188060 2014-03-15   2013 York City Wycombe Wanderers 2-0     2     0
##        division tier totgoal goaldif result       date
## 188055        4    4       6       2      H 2013-09-28
## 188056        4    4       0       0      D 2013-11-30
## 188057        4    4       5       3      H 2013-10-29
## 188058        4    4       0       0      D 2014-02-22
## 188059        4    4       1       1      H 2014-03-25
## 188060        4    4       2       2      H 2014-03-15

 

The first thing to do is filter the data to only include the team we’re interested in looking at. The following code shows the required spelling of all teams in the dataset alphabetically:

sort(unique(engsoccerdata2$home))
##   [1] "Aberdare Athletic"        "Accrington"              
##   [3] "Accrington F.C."          "Accrington Stanley"      
##   [5] "AFC Bournemouth"          "AFC Wimbledon"           
##   [7] "Aldershot"                "Arsenal"                 
##   [9] "Ashington"                "Aston Villa"             
##  [11] "Barnet"                   "Barnsley"                
##  [13] "Barrow"                   "Birmingham City"         
##  [15] "Blackburn Rovers"         "Blackpool"               
##  [17] "Bolton Wanderers"         "Bootle"                  
##  [19] "Boston United"            "Bradford City"           
##  [21] "Bradford Park Avenue"     "Brentford"               
##  [23] "Brighton & Hove Albion"   "Bristol City"            
##  [25] "Bristol Rovers"           "Burnley"                 
##  [27] "Burton Albion"            "Burton Swifts"           
##  [29] "Burton United"            "Burton Wanderers"        
##  [31] "Bury"                     "Cambridge United"        
##  [33] "Cardiff City"             "Carlisle United"         
##  [35] "Charlton Athletic"        "Chelsea"                 
##  [37] "Cheltenham"               "Chester"                 
##  [39] "Chesterfield"             "Colchester United"       
##  [41] "Coventry City"            "Crawley Town"            
##  [43] "Crewe Alexandra"          "Crystal Palace"          
##  [45] "Dagenham and Redbridge"   "Darlington"              
##  [47] "Darwen"                   "Derby County"            
##  [49] "Doncaster Rovers"         "Durham City"             
##  [51] "Everton"                  "Exeter City"             
##  [53] "Fleetwood Town"           "Fulham"                  
##  [55] "Gainsborough Trinity"     "Gateshead"               
##  [57] "Gillingham"               "Glossop North End"       
##  [59] "Grimsby Town"             "Halifax Town"            
##  [61] "Hartlepool United"        "Hereford United"         
##  [63] "Huddersfield Town"        "Hull City"               
##  [65] "Ipswich Town"             "Kidderminster Harriers"  
##  [67] "Leeds City"               "Leeds United"            
##  [69] "Leicester City"           "Leyton Orient"           
##  [71] "Lincoln City"             "Liverpool"               
##  [73] "Loughborough"             "Luton Town"              
##  [75] "Macclesfield"             "Maidstone United"        
##  [77] "Manchester City"          "Manchester United"       
##  [79] "Mansfield Town"           "Merthyr Town"            
##  [81] "Middlesbrough"            "Middlesbrough Ironopolis"
##  [83] "Millwall"                 "Milton Keynes Dons"      
##  [85] "Morecambe"                "Nelson"                  
##  [87] "New Brighton"             "New Brighton Tower"      
##  [89] "Newcastle United"         "Newport County"          
##  [91] "Northampton Town"         "Northwich Victoria"      
##  [93] "Norwich City"             "Nottingham Forest"       
##  [95] "Notts County"             "Oldham Athletic"         
##  [97] "Oxford United"            "Peterborough United"     
##  [99] "Plymouth Argyle"          "Port Vale"               
## [101] "Portsmouth"               "Preston North End"       
## [103] "Queens Park Rangers"      "Reading"                 
## [105] "Rochdale"                 "Rotherham County"        
## [107] "Rotherham Town"           "Rotherham United"        
## [109] "Rushden & Diamonds"       "Scarborough"             
## [111] "Scunthorpe United"        "Sheffield United"        
## [113] "Sheffield Wednesday"      "Shrewsbury Town"         
## [115] "South Shields"            "Southampton"             
## [117] "Southend United"          "Southport"               
## [119] "Stalybridge Celtic"       "Stevenage Borough"       
## [121] "Stockport County"         "Stoke City"              
## [123] "Sunderland"               "Swansea City"            
## [125] "Swindon Town"             "Thames"                  
## [127] "Torquay United"           "Tottenham Hotspur"       
## [129] "Tranmere Rovers"          "Walsall"                 
## [131] "Watford"                  "West Bromwich Albion"    
## [133] "West Ham United"          "Wigan Athletic"          
## [135] "Wigan Borough"            "Wimbledon"               
## [137] "Wolverhampton Wanderers"  "Workington"              
## [139] "Wrexham"                  "Wycombe Wanderers"       
## [141] "Yeovil"                   "York City"

 

Filtering data

Let’s say we want to know about how many goals Liverpool have cumulatively scored in Premier League games within each season. (The first season of the EPL was 1992/93):

df <- df %>% filter(home=="Liverpool" | visitor=="Liverpool")  %>% filter(tier==1 & Season>=1992)

  Next, split every season into its own dataframe - this will be stored as a list.

# Split by Season
myseasons <- split(df,df$Season)

 

  This is the first few rows of the first season in our filtered data. Note that the data are not sorted by date initially but in alphabetical order of the home team:

head(myseasons[[1]])
##         Date Season             home   visitor  FT hgoal vgoal division
## 1 1993-01-31   1992          Arsenal Liverpool 0-1     0     1        1
## 2 1992-09-19   1992      Aston Villa Liverpool 4-2     4     2        1
## 3 1993-04-03   1992 Blackburn Rovers Liverpool 4-1     4     1        1
## 4 1993-02-10   1992          Chelsea Liverpool 0-0     0     0        1
## 5 1992-12-19   1992    Coventry City Liverpool 5-1     5     1        1
## 6 1993-03-23   1992   Crystal Palace Liverpool 1-1     1     1        1
##   tier totgoal goaldif result       date
## 1    1       1      -1      A 1993-01-31
## 2    1       6       2      H 1992-09-19
## 3    1       5       3      H 1993-04-03
## 4    1       0       0      D 1993-02-10
## 5    1       6       4      H 1992-12-19
## 6    1       2       0      D 1993-03-23

 

Getting cumulative data

What we need to do now is write a function that we can apply to every dataframe in the list myseasons . This function will turn the raw data into cumulative goals scored, goals conceded, goal difference and points. We could also look at cumulative home wins, defeats, away draws etc. but I won’t do that here.

In brief, what this function does is to use dplyr to separate all the home and away results for each team and to create the following variables: teamname, opponent, goals for, goals against, goal difference, result (win, draw, loss) and venue (home or away). These two dataframes are then combined into a ‘dfboth’ dataframe. We can then rank the game number across a season using dense_rank from dplyr and then calculate all the cumulative variables using cumsum base-R function.

getdatafun1 <- function(df, teamname){
  
  dfhome <- df %>% mutate(GF=as.numeric(as.character(hgoal)),
                          GA=as.numeric(as.character(vgoal)),
                          GD = GF-GA,
                          result=ifelse(GD>0, "W", ifelse(GD<0, "L", "D")),
                          venue="home") %>%
    select(date, team=home, opponent=visitor, GF,GA,GD,result,venue)
  
  dfaway <- df %>% mutate(GF=as.numeric(as.character(vgoal)),
                          GA=as.numeric(as.character(hgoal)),
                          GD = GF-GA,
                          result=ifelse(GD>0, "W", ifelse(GD<0, "L", "D")),
                          venue="away") %>%
    
    select(date, team=visitor, opponent=home, GF,GA,GD,result,venue)
  
  dfboth<-rbind(dfhome,dfaway)
  
  dfboth <- dfboth %>% 
    filter(team==teamname) %>%
    mutate(Gameno = dense_rank(date)) %>% 
    arrange(Gameno) %>% 
    mutate(Pts = ifelse(result=="W", 3, ifelse(result=="D", 1, 0))) %>% 
    mutate(Cumpts = cumsum(Pts),
           CumGF = cumsum(GF),
           CumGA = cumsum(GA),
           CumGD = cumsum(GD)) %>%
    select(Gameno, Cumpts, CumGF, CumGA, CumGD)
  return(dfboth)
}

 

As an example, here is the function applied to Liverpool’s 1992/93 season (the first EPL season). Note that we are required to name the team we are interested in again - the function uses this 2nd argument to recognize what to filter from the inputted dataframe:

getdatafun1(myseasons[[1]], "Liverpool") 
##    Gameno Cumpts CumGF CumGA CumGD
## 1       1      0     0     1    -1
## 2       2      3     2     2     0
## 3       3      3     2     4    -2
## 4       4      4     4     6    -2
## 5       5      5     6     8    -2
## 6       6      6     7     9    -2
## 7       7      9     9    10    -1
## 8       8      9     9    11    -2
## 9       9      9    11    15    -4
## 10     10      9    13    18    -5
## 11     11     12    14    18    -4
## 12     12     13    16    20    -4
## 13     13     16    20    21    -1
## 14     14     16    20    23    -3
## 15     15     19    24    24     0
## 16     16     22    25    24     1
## 17     17     25    30    24     6
## 18     18     25    31    26     5
## 19     19     28    33    27     6
## 20     20     28    34    32     2
## 21     21     29    35    33     2
## 22     22     29    36    35     1
## 23     23     29    36    37    -1
## 24     24     32    37    37     0
## 25     25     33    37    37     0
## 26     26     34    37    37     0
## 27     27     34    38    39    -1
## 28     28     35    38    39    -1
## 29     29     36    39    40    -1
## 30     30     36    40    42    -2
## 31     31     39    41    42    -1
## 32     32     42    43    43     0
## 33     33     45    44    43     1
## 34     34     46    45    44     1
## 35     35     46    46    48    -2
## 36     36     49    47    48    -1
## 37     37     50    48    49    -1
## 38     38     53    52    49     3
## 39     39     56    54    49     5
## 40     40     56    54    50     4
## 41     41     56    56    53     3
## 42     42     59    62    55     7

 

OK - apply this function to all dataframes in myseasons using lapply, then convert to one long dataframe and add in a new variable (id) that refers to the season the data comes from:

x <- lapply(myseasons, getdatafun1, teamname="Liverpool")
alldata <- do.call("rbind", x)
alldata$id <- rep(names(x), sapply(x, nrow))
head(alldata, 10)
##         Gameno Cumpts CumGF CumGA CumGD   id
## 1992.1       1      0     0     1    -1 1992
## 1992.2       2      3     2     2     0 1992
## 1992.3       3      3     2     4    -2 1992
## 1992.4       4      4     4     6    -2 1992
## 1992.5       5      5     6     8    -2 1992
## 1992.6       6      6     7     9    -2 1992
## 1992.7       7      9     9    10    -1 1992
## 1992.8       8      9     9    11    -2 1992
## 1992.9       9      9    11    15    -4 1992
## 1992.10     10      9    13    18    -5 1992

 

Get the 2014/15 data

  My engsoccerdata package contains data up to the end of the 2013/14 season. I’ve decided to update it at the end of each season, so we will have to get the 2014/15 data from elsewhere. A straightforward way would be to manually enter the data - though I am generally anti doing anything manual when it comes to data entry.

A pretty straightforward approach would be just to scrape the data from ESPNFC. This webpage contains a list of all of this season’s Liverpool EPL games. Just change the filter options and use the appropriate adjusted web link if looking at another team/division.

I am using the readHTMLTable function from the XML libarary. The results of this are stored as a list of 3 elements. The 2nd element contains the data. Looking at the variable names and cross-referencing to the web page, we only need to keep variables 1 (date), 5 (Home), 7 (Away) and 6 (not named but contains the FT scoreline):

doc <- readHTMLTable("http://www.espn.co.uk/football/sport/match/index.html?event=3;team=305;type=results")
colnames(doc[[2]])
##  [1] "Date"    "BST/GMT" "TV"      ""        "Home"    ""        "Away"   
##  [8] "HT"      "Attend." ""

 

I rename the variables - note this also contains games yet to be played:

temp <- doc[[2]][c(1,5:7)]
colnames(temp)<-c("date", "home", "FT", "visitor")
temp
##              date                 home   FT              visitor
## 1     August 2014                 <NA> <NA>                 <NA>
## 2          Sun 17            Liverpool  2-1          Southampton
## 3          Mon 25      Manchester City  3-1            Liverpool
## 4          Sun 31    Tottenham Hotspur  0-3            Liverpool
## 5  September 2014                 <NA> <NA>                 <NA>
## 6          Sat 13            Liverpool  0-1          Aston Villa
## 7          Sat 20      West Ham United  3-1            Liverpool
## 8          Sat 27            Liverpool  1-1              Everton
## 9    October 2014                 <NA> <NA>                 <NA>
## 10          Sat 4            Liverpool  2-1 West Bromwich Albion
## 11         Sun 19  Queens Park Rangers  2-3            Liverpool
## 12         Sat 25            Liverpool  0-0            Hull City
## 13  November 2014                 <NA> <NA>                 <NA>
## 14          Sat 1     Newcastle United  1-0            Liverpool
## 15          Sat 8            Liverpool  1-2              Chelsea
## 16         Sun 23       Crystal Palace  3-1            Liverpool
## 17         Sat 29            Liverpool  1-0           Stoke City
## 18  December 2014                 <NA> <NA>                 <NA>
## 19          Tue 2       Leicester City  1-3            Liverpool
## 20          Sat 6            Liverpool  0-0           Sunderland
## 21         Sun 14    Manchester United  3-0            Liverpool
## 22         Sun 21            Liverpool  2-2              Arsenal
## 23         Fri 26              Burnley  0-1            Liverpool
## 24         Mon 29            Liverpool  4-1         Swansea City
## 25   January 2015                 <NA> <NA>                 <NA>
## 26          Thu 1            Liverpool  2-2       Leicester City
## 27         Sat 10           Sunderland    v            Liverpool
## 28         Sat 17          Aston Villa    v            Liverpool
## 29         Sat 31            Liverpool    v      West Ham United
## 30  February 2015                 <NA> <NA>                 <NA>
## 31          Sat 7              Everton    v            Liverpool
## 32         Tue 10            Liverpool    v    Tottenham Hotspur
## 33         Sun 22          Southampton    v            Liverpool
## 34     March 2015                 <NA> <NA>                 <NA>
## 35          Sun 1            Liverpool    v      Manchester City
## 36          Wed 4            Liverpool    v              Burnley
## 37         Sat 14         Swansea City    v            Liverpool
## 38         Sat 21            Liverpool    v    Manchester United
## 39     April 2015                 <NA> <NA>                 <NA>
## 40          Sat 4              Arsenal    v            Liverpool
## 41         Sat 11            Liverpool    v     Newcastle United
## 42         Sat 18            Hull City    v            Liverpool
## 43         Sat 25 West Bromwich Albion    v            Liverpool
## 44       May 2015                 <NA> <NA>                 <NA>
## 45          Sat 2            Liverpool    v  Queens Park Rangers
## 46          Sat 9              Chelsea    v            Liverpool
## 47         Sat 16            Liverpool    v       Crystal Palace
## 48         Sun 24           Stoke City    v            Liverpool

 

You will notice that the date variable is not yet nice and neat. For our purposes here, we could safely just delete the blank rows where each new month is listed as the results are already in date order. However, just for completeness sake, I shall create a proper date variable and sort by this.

My strategy here is to create an empty vector of the original date variable length and to add in the month/years to that empty vector at the position that they occur in. This essentially means only copying the date variable when there is a “NA” in any of the other three variables.

tempx <- NULL
tempx <- ifelse(is.na(temp$home)==T, as.character(temp$date), NA)  #as.character() because 'date' is a factor
tempx
##  [1] "August 2014"    NA               NA               NA              
##  [5] "September 2014" NA               NA               NA              
##  [9] "October 2014"   NA               NA               NA              
## [13] "November 2014"  NA               NA               NA              
## [17] NA               "December 2014"  NA               NA              
## [21] NA               NA               NA               NA              
## [25] "January 2015"   NA               NA               NA              
## [29] NA               "February 2015"  NA               NA              
## [33] NA               "March 2015"     NA               NA              
## [37] NA               NA               "April 2015"     NA              
## [41] NA               NA               NA               "May 2015"      
## [45] NA               NA               NA               NA

  Now I fill down the NAs in the new ‘tempx’ vector using the na.locf in zoo package. This will copy a character element through empty elements until it reaches a new character element. It is in my top 5 favorite R functions! A quick cautionary note: this function does mask the as.Date function in base-R which we will need later, so I am not loading the zoo library, but just calling the function directly from it using :: :

tempx <- zoo::na.locf(tempx)
tempx
##  [1] "August 2014"    "August 2014"    "August 2014"    "August 2014"   
##  [5] "September 2014" "September 2014" "September 2014" "September 2014"
##  [9] "October 2014"   "October 2014"   "October 2014"   "October 2014"  
## [13] "November 2014"  "November 2014"  "November 2014"  "November 2014" 
## [17] "November 2014"  "December 2014"  "December 2014"  "December 2014" 
## [21] "December 2014"  "December 2014"  "December 2014"  "December 2014" 
## [25] "January 2015"   "January 2015"   "January 2015"   "January 2015"  
## [29] "January 2015"   "February 2015"  "February 2015"  "February 2015" 
## [33] "February 2015"  "March 2015"     "March 2015"     "March 2015"    
## [37] "March 2015"     "March 2015"     "April 2015"     "April 2015"    
## [41] "April 2015"     "April 2015"     "April 2015"     "May 2015"      
## [45] "May 2015"       "May 2015"       "May 2015"       "May 2015"

 

Now, we paste together the contents of the scraped ‘date’ variable with this new vector to get each game’s actual date:

paste(temp$date, tempx, sep=" ")
##  [1] "August 2014 August 2014"       "Sun 17 August 2014"           
##  [3] "Mon 25 August 2014"            "Sun 31 August 2014"           
##  [5] "September 2014 September 2014" "Sat 13 September 2014"        
##  [7] "Sat 20 September 2014"         "Sat 27 September 2014"        
##  [9] "October 2014 October 2014"     "Sat 4 October 2014"           
## [11] "Sun 19 October 2014"           "Sat 25 October 2014"          
## [13] "November 2014 November 2014"   "Sat 1 November 2014"          
## [15] "Sat 8 November 2014"           "Sun 23 November 2014"         
## [17] "Sat 29 November 2014"          "December 2014 December 2014"  
## [19] "Tue 2 December 2014"           "Sat 6 December 2014"          
## [21] "Sun 14 December 2014"          "Sun 21 December 2014"         
## [23] "Fri 26 December 2014"          "Mon 29 December 2014"         
## [25] "January 2015 January 2015"     "Thu 1 January 2015"           
## [27] "Sat 10 January 2015"           "Sat 17 January 2015"          
## [29] "Sat 31 January 2015"           "February 2015 February 2015"  
## [31] "Sat 7 February 2015"           "Tue 10 February 2015"         
## [33] "Sun 22 February 2015"          "March 2015 March 2015"        
## [35] "Sun 1 March 2015"              "Wed 4 March 2015"             
## [37] "Sat 14 March 2015"             "Sat 21 March 2015"            
## [39] "April 2015 April 2015"         "Sat 4 April 2015"             
## [41] "Sat 11 April 2015"             "Sat 18 April 2015"            
## [43] "Sat 25 April 2015"             "May 2015 May 2015"            
## [45] "Sat 2 May 2015"                "Sat 9 May 2015"               
## [47] "Sat 16 May 2015"               "Sun 24 May 2015"

 

Next, we add this to the dataframe as a new variable and remove incomplete rows (i.e. those with NAs that originally only contained month/year information):

temp$date <- paste(temp$date, tempx, sep=" ")
temp <- temp[complete.cases(temp),]

 

Let’s convert this variable to R’s date format:

as.Date(temp$date, format="%a %d %B %Y")
##  [1] "2014-08-17" "2014-08-25" "2014-08-31" "2014-09-13" "2014-09-20"
##  [6] "2014-09-27" "2014-10-04" "2014-10-19" "2014-10-25" "2014-11-01"
## [11] "2014-11-08" "2014-11-23" "2014-11-29" "2014-12-02" "2014-12-06"
## [16] "2014-12-14" "2014-12-21" "2014-12-26" "2014-12-29" "2015-01-01"
## [21] "2015-01-10" "2015-01-17" "2015-01-31" "2015-02-07" "2015-02-10"
## [26] "2015-02-22" "2015-03-01" "2015-03-04" "2015-03-14" "2015-03-21"
## [31] "2015-04-04" "2015-04-11" "2015-04-18" "2015-04-25" "2015-05-02"
## [36] "2015-05-09" "2015-05-16" "2015-05-24"
temp$date <- as.Date(temp$date, format="%a %d %B %Y")

  And then, get rid of games yet to be played - I do this by using grepl to determine if the FT variable contains a ‘v’ or not and only keep those that do not:

grepl("v", as.character(temp$FT))
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
## [23]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [34]  TRUE  TRUE  TRUE  TRUE  TRUE
temp <- temp[grepl("v", as.character(temp$FT))==F, ]
temp
##          date                home  FT              visitor
## 2  2014-08-17           Liverpool 2-1          Southampton
## 3  2014-08-25     Manchester City 3-1            Liverpool
## 4  2014-08-31   Tottenham Hotspur 0-3            Liverpool
## 6  2014-09-13           Liverpool 0-1          Aston Villa
## 7  2014-09-20     West Ham United 3-1            Liverpool
## 8  2014-09-27           Liverpool 1-1              Everton
## 10 2014-10-04           Liverpool 2-1 West Bromwich Albion
## 11 2014-10-19 Queens Park Rangers 2-3            Liverpool
## 12 2014-10-25           Liverpool 0-0            Hull City
## 14 2014-11-01    Newcastle United 1-0            Liverpool
## 15 2014-11-08           Liverpool 1-2              Chelsea
## 16 2014-11-23      Crystal Palace 3-1            Liverpool
## 17 2014-11-29           Liverpool 1-0           Stoke City
## 19 2014-12-02      Leicester City 1-3            Liverpool
## 20 2014-12-06           Liverpool 0-0           Sunderland
## 21 2014-12-14   Manchester United 3-0            Liverpool
## 22 2014-12-21           Liverpool 2-2              Arsenal
## 23 2014-12-26             Burnley 0-1            Liverpool
## 24 2014-12-29           Liverpool 4-1         Swansea City
## 26 2015-01-01           Liverpool 2-2       Leicester City

Finally, add the ‘hgoal’ and ‘vgoal’ variables that we use to calculate cumulative goals and results. I do this using separate from tidyr - the remove=F makes sure that the original ‘FT’ variable is kept:

temp <- temp %>% separate(FT, c("hgoal", "vgoal"), sep="-", remove=F)
temp
##          date                home  FT hgoal vgoal              visitor
## 2  2014-08-17           Liverpool 2-1     2     1          Southampton
## 3  2014-08-25     Manchester City 3-1     3     1            Liverpool
## 4  2014-08-31   Tottenham Hotspur 0-3     0     3            Liverpool
## 6  2014-09-13           Liverpool 0-1     0     1          Aston Villa
## 7  2014-09-20     West Ham United 3-1     3     1            Liverpool
## 8  2014-09-27           Liverpool 1-1     1     1              Everton
## 10 2014-10-04           Liverpool 2-1     2     1 West Bromwich Albion
## 11 2014-10-19 Queens Park Rangers 2-3     2     3            Liverpool
## 12 2014-10-25           Liverpool 0-0     0     0            Hull City
## 14 2014-11-01    Newcastle United 1-0     1     0            Liverpool
## 15 2014-11-08           Liverpool 1-2     1     2              Chelsea
## 16 2014-11-23      Crystal Palace 3-1     3     1            Liverpool
## 17 2014-11-29           Liverpool 1-0     1     0           Stoke City
## 19 2014-12-02      Leicester City 1-3     1     3            Liverpool
## 20 2014-12-06           Liverpool 0-0     0     0           Sunderland
## 21 2014-12-14   Manchester United 3-0     3     0            Liverpool
## 22 2014-12-21           Liverpool 2-2     2     2              Arsenal
## 23 2014-12-26             Burnley 0-1     0     1            Liverpool
## 24 2014-12-29           Liverpool 4-1     4     1         Swansea City
## 26 2015-01-01           Liverpool 2-2     2     2       Leicester City

 

Get cumulative data for 2014/15 season

temp2014 <- getdatafun1(temp, "Liverpool")
temp2014$id <- 2014
head(temp2014)
##   Gameno Cumpts CumGF CumGA CumGD   id
## 1      1      3     2     1     1 2014
## 2      2      3     3     4    -1 2014
## 3      3      6     6     4     2 2014
## 4      4      6     6     5     1 2014
## 5      5      6     7     8    -1 2014
## 6      6      7     8     9    -1 2014

  Bind the data from previous seasons and 2014/15 together:

mydf<-rbind(alldata,temp2014)

 

Visualization set-up


We are going to visualize the ‘mydf’ dataframe. First a couple of little extra bits. Add a row for every season in the dataframe with each variable of interest being equal to 0 - this is so all the lines start from 0 and not at 1.

tmp <- data.frame(Gameno=0,Cumpts=0, CumGF=0, CumGA=0, CumGD=0, id=unique(mydf$id))
mydf<-rbind(mydf,tmp)


Now add a variable containing two groups. A ‘0’ indicates all the lines/seasons that we’ll make gray, the ‘1’ indicates the lines/seasons that we’ll make a different color.

mydf$grp <- ifelse(mydf$id <2014, 0, 1)
head(mydf)
##        Gameno Cumpts CumGF CumGA CumGD   id grp
## 1992.1      1      0     0     1    -1 1992   0
## 1992.2      2      3     2     2     0 1992   0
## 1992.3      3      3     2     4    -2 1992   0
## 1992.4      4      4     4     6    -2 1992   0
## 1992.5      5      5     6     8    -2 1992   0
## 1992.6      6      6     7     9    -2 1992   0

 

Lastly, I’m subsetting a dataframe only for the current season - this will help me make this line thicker:

x2014 <- mydf %>% filter(id==2014)

 

Visualization !!!

ggplot(mydf, aes(Gameno, CumGF, group=id, color=grp)) +
  geom_line(aes(group=id, color=factor(grp))) +
  geom_line(data=x2014, aes(Gameno, CumGF, group=id, color=factor(grp)), lwd=1.5) +
  xlab("Game number") + ylab("Cumulative goals") +
  scale_color_manual(values=c("gray80", "red")) +
  ggtitle("Liverpool - Premier League goals by game") +
  theme(
    plot.title = element_text(hjust=0,vjust=1, size=rel(1.7)),
    panel.background = element_blank(),
    panel.grid.major.y = element_line(color="gray65"),
    panel.grid.major.x = element_line(color="gray65"),
    panel.grid.minor = element_blank(),
    plot.background  = element_blank(),
    text = element_text(color="gray20", size=10),
    axis.text = element_text(size=rel(1.0)),
    axis.text.x = element_text(color="gray20",size=rel(1.5)),
    axis.text.y = element_text(color="gray20", size=rel(1.5)),
    axis.title.x = element_text(size=rel(1.5), vjust=0),
    axis.title.y = element_text(size=rel(1.5), vjust=1),
    axis.ticks.y = element_blank(),
    axis.ticks.x = element_blank(),
    legend.position = "none"
  )

 

You can annotate ggplots using geom_text but in my experience, it can be quite fiddly. For a graph like this, I prefer manual editing in an illustrator or paint type package. For instance, to identify some of the outlying lines/season, we can do this:

mydf %>% filter(Gameno==20) %>% arrange(desc(CumGF))
##    Gameno Cumpts CumGF CumGA CumGD   id grp
## 1      20     39    46    23    23 2013   0
## 2      20     39    37    19    18 1996   0
## 3      20     33    37    25    12 2000   0
## 4      20     33    37    25    12 2009   0
## 5      20     36    36    19    17 1994   0
## 6      20     35    36    18    18 1995   0
## 7      20     37    36    19    17 1997   0
## 8      20     31    36    25    11 1998   0
## 9      20     45    35    13    22 2008   0
## 10     20     28    34    32     2 1992   0
## 11     20     34    34    20    14 2004   0
## 12     20     38    34    13    21 2007   0
## 13     20     31    33    26     7 1993   0
## 14     20     37    31    17    14 1999   0
## 15     20     28    31    26     5 2012   0
## 16     20     38    30    20    10 2001   0
## 17     20     32    30    21     9 2003   0
## 18     20     33    29    20     9 2002   0
## 19     20     44    29    11    18 2005   0
## 20     20     34    28    16    12 2006   0
## 21     20     29    28    27     1 2014   1
## 22     20     25    24    27    -3 2010   0
## 23     20     34    24    18     6 2011   0

 

Adding this information to the chart gives this:


Example 2:

I shall put all this code into one chunk - hopefully by reading the above you can see how the chart is made. Let’s look at the cumulative points of Manchester United in the EPL by manager:

#1992/93 - 2014/15 data

df1 <- engsoccerdata2
df1$date <- as.Date(df1$Date, format="%Y-%m-%d")

df1 <- df1 %>% filter(home=="Manchester United" | visitor=="Manchester United")  %>% filter(Season>=1992)

myseasons1 <- split(df1,df1$Season)

x1 <- lapply(myseasons1, getdatafun1, teamname="Manchester United")
alldata1 <- do.call("rbind", x1)
alldata1$id <- rep(names(x1), sapply(x1, nrow))
#2014/15 data
doc1 <- readHTMLTable("http://www.espn.co.uk/football/sport/match/index.html?event=3;team=311;type=results")

temp1 <- doc1[[2]][c(1,5:7)]
colnames(temp1)<-c("date", "home", "FT", "visitor")

tempx1 <- NULL
tempx1 <- ifelse(is.na(temp1$home)==T, as.character(temp1$date), NA)  #as.character() because 'date' is a factor

tempx1 <- zoo::na.locf(tempx1)

temp1$date <- paste(temp1$date, tempx1, sep=" ")
temp1 <- temp1[complete.cases(temp1),]
temp1$date <- as.Date(temp1$date, format="%a %d %B %Y")

temp1 <- temp1[grepl("v", as.character(temp1$FT))==F, ]

temp1 <- temp1 %>% separate(FT, c("hgoal", "vgoal"), sep="-", remove=F)

temp2014a <- getdatafun1(temp1, "Manchester United")
temp2014a$id <- 2014
# combine data
mydf1<-rbind(alldata1,temp2014a)

tmp1 <- data.frame(Gameno=0,Cumpts=0, CumGF=0, CumGA=0, CumGD=0, id=unique(mydf1$id))
mydf1<-rbind(mydf1,tmp1)

mydf1$grp <- ifelse(mydf1$id <2013, 0, 
                    ifelse(mydf1$id ==2013, 1,
                    2)
)
mu2014 <- mydf1 %>% filter(id==2014)
mu2013 <- mydf1 %>% filter(id==2013)
## Visualize

ggplot(mydf1, aes(Gameno, Cumpts, group=id, color=grp)) +
  geom_line(aes(group=id, color=factor(grp))) +
  geom_line(data=mu2013, aes(Gameno, Cumpts, group=id, color=factor(grp)), lwd=1.1) +  
  geom_line(data=mu2014, aes(Gameno, Cumpts, group=id, color=factor(grp)), lwd=1.1) +
  xlab("Game number") + ylab("Cumulative points") +
  scale_color_manual(values=c("gray80", "red4", "red")) +
  ggtitle("Man Utd - Premier League points by game") +
  theme(
    plot.title = element_text(hjust=0,vjust=1, size=rel(1.7)),
    panel.background = element_blank(),
    panel.grid.major.y = element_line(color="gray65"),
    panel.grid.major.x = element_line(color="gray65"),
    panel.grid.minor = element_blank(),
    plot.background  = element_blank(),
    text = element_text(color="gray20", size=10),
    axis.text = element_text(size=rel(1.0)),
    axis.text.x = element_text(color="gray20",size=rel(1.5)),
    axis.text.y = element_text(color="gray20", size=rel(1.5)),
    axis.title.x = element_text(size=rel(1.5), vjust=0),
    axis.title.y = element_text(size=rel(1.5), vjust=1),
    axis.ticks.y = element_blank(),
    axis.ticks.x = element_blank(),
    legend.position = "none"
  )

 

… and with some extra edits…