In this guide I’ll show you how to create simple charts that track the cumulative goals scored and conceded and points gained over the course of different seasons. We can do this very easily using my R package engsoccerdata that contains the date and result of every league game ever played.
First install my engsoccerdata package from GitHub if you haven’t already. Make sure you have the devtools package loaded:
library(devtools)
install_github('jalapic/engsoccerdata', username = "jalapic")
Now load the required packages. In addition to engsoccerdata, we are also using dplyr to flexibly restructure our data and ggplot2 for visualizing it. We shall also be using XML to do some web-scraping for 2014/15 data. tidyr is used for it’s separate function.
library(engsoccerdata)
library(dplyr)
library(ggplot2)
library(XML)
library(tidyr)
The dataset to use is engsoccerdata2 - this contains all league results up to the end of the 2013/14 season. Each season in this dataset is referred to by the year that the season began in. i.e. 2013 refers to the 2013/14 season. This is what it looks like after adding an extra variable in the appropriate date format:
df <- engsoccerdata2
df$date <- as.Date(df$Date, format="%Y-%m-%d")
tail(df)
## Date Season home visitor FT hgoal vgoal
## 188055 2013-09-28 2013 York City Portsmouth 4-2 4 2
## 188056 2013-11-30 2013 York City Rochdale 0-0 0 0
## 188057 2013-10-29 2013 York City Scunthorpe United 4-1 4 1
## 188058 2014-02-22 2013 York City Southend United 0-0 0 0
## 188059 2014-03-25 2013 York City Torquay United 1-0 1 0
## 188060 2014-03-15 2013 York City Wycombe Wanderers 2-0 2 0
## division tier totgoal goaldif result date
## 188055 4 4 6 2 H 2013-09-28
## 188056 4 4 0 0 D 2013-11-30
## 188057 4 4 5 3 H 2013-10-29
## 188058 4 4 0 0 D 2014-02-22
## 188059 4 4 1 1 H 2014-03-25
## 188060 4 4 2 2 H 2014-03-15
The first thing to do is filter the data to only include the team we’re interested in looking at. The following code shows the required spelling of all teams in the dataset alphabetically:
sort(unique(engsoccerdata2$home))
## [1] "Aberdare Athletic" "Accrington"
## [3] "Accrington F.C." "Accrington Stanley"
## [5] "AFC Bournemouth" "AFC Wimbledon"
## [7] "Aldershot" "Arsenal"
## [9] "Ashington" "Aston Villa"
## [11] "Barnet" "Barnsley"
## [13] "Barrow" "Birmingham City"
## [15] "Blackburn Rovers" "Blackpool"
## [17] "Bolton Wanderers" "Bootle"
## [19] "Boston United" "Bradford City"
## [21] "Bradford Park Avenue" "Brentford"
## [23] "Brighton & Hove Albion" "Bristol City"
## [25] "Bristol Rovers" "Burnley"
## [27] "Burton Albion" "Burton Swifts"
## [29] "Burton United" "Burton Wanderers"
## [31] "Bury" "Cambridge United"
## [33] "Cardiff City" "Carlisle United"
## [35] "Charlton Athletic" "Chelsea"
## [37] "Cheltenham" "Chester"
## [39] "Chesterfield" "Colchester United"
## [41] "Coventry City" "Crawley Town"
## [43] "Crewe Alexandra" "Crystal Palace"
## [45] "Dagenham and Redbridge" "Darlington"
## [47] "Darwen" "Derby County"
## [49] "Doncaster Rovers" "Durham City"
## [51] "Everton" "Exeter City"
## [53] "Fleetwood Town" "Fulham"
## [55] "Gainsborough Trinity" "Gateshead"
## [57] "Gillingham" "Glossop North End"
## [59] "Grimsby Town" "Halifax Town"
## [61] "Hartlepool United" "Hereford United"
## [63] "Huddersfield Town" "Hull City"
## [65] "Ipswich Town" "Kidderminster Harriers"
## [67] "Leeds City" "Leeds United"
## [69] "Leicester City" "Leyton Orient"
## [71] "Lincoln City" "Liverpool"
## [73] "Loughborough" "Luton Town"
## [75] "Macclesfield" "Maidstone United"
## [77] "Manchester City" "Manchester United"
## [79] "Mansfield Town" "Merthyr Town"
## [81] "Middlesbrough" "Middlesbrough Ironopolis"
## [83] "Millwall" "Milton Keynes Dons"
## [85] "Morecambe" "Nelson"
## [87] "New Brighton" "New Brighton Tower"
## [89] "Newcastle United" "Newport County"
## [91] "Northampton Town" "Northwich Victoria"
## [93] "Norwich City" "Nottingham Forest"
## [95] "Notts County" "Oldham Athletic"
## [97] "Oxford United" "Peterborough United"
## [99] "Plymouth Argyle" "Port Vale"
## [101] "Portsmouth" "Preston North End"
## [103] "Queens Park Rangers" "Reading"
## [105] "Rochdale" "Rotherham County"
## [107] "Rotherham Town" "Rotherham United"
## [109] "Rushden & Diamonds" "Scarborough"
## [111] "Scunthorpe United" "Sheffield United"
## [113] "Sheffield Wednesday" "Shrewsbury Town"
## [115] "South Shields" "Southampton"
## [117] "Southend United" "Southport"
## [119] "Stalybridge Celtic" "Stevenage Borough"
## [121] "Stockport County" "Stoke City"
## [123] "Sunderland" "Swansea City"
## [125] "Swindon Town" "Thames"
## [127] "Torquay United" "Tottenham Hotspur"
## [129] "Tranmere Rovers" "Walsall"
## [131] "Watford" "West Bromwich Albion"
## [133] "West Ham United" "Wigan Athletic"
## [135] "Wigan Borough" "Wimbledon"
## [137] "Wolverhampton Wanderers" "Workington"
## [139] "Wrexham" "Wycombe Wanderers"
## [141] "Yeovil" "York City"
Let’s say we want to know about how many goals Liverpool have cumulatively scored in Premier League games within each season. (The first season of the EPL was 1992/93):
df <- df %>% filter(home=="Liverpool" | visitor=="Liverpool") %>% filter(tier==1 & Season>=1992)
Next, split every season into its own dataframe - this will be stored as a list.
# Split by Season
myseasons <- split(df,df$Season)
This is the first few rows of the first season in our filtered data. Note that the data are not sorted by date initially but in alphabetical order of the home team:
head(myseasons[[1]])
## Date Season home visitor FT hgoal vgoal division
## 1 1993-01-31 1992 Arsenal Liverpool 0-1 0 1 1
## 2 1992-09-19 1992 Aston Villa Liverpool 4-2 4 2 1
## 3 1993-04-03 1992 Blackburn Rovers Liverpool 4-1 4 1 1
## 4 1993-02-10 1992 Chelsea Liverpool 0-0 0 0 1
## 5 1992-12-19 1992 Coventry City Liverpool 5-1 5 1 1
## 6 1993-03-23 1992 Crystal Palace Liverpool 1-1 1 1 1
## tier totgoal goaldif result date
## 1 1 1 -1 A 1993-01-31
## 2 1 6 2 H 1992-09-19
## 3 1 5 3 H 1993-04-03
## 4 1 0 0 D 1993-02-10
## 5 1 6 4 H 1992-12-19
## 6 1 2 0 D 1993-03-23
What we need to do now is write a function that we can apply to every dataframe in the list myseasons . This function will turn the raw data into cumulative goals scored, goals conceded, goal difference and points. We could also look at cumulative home wins, defeats, away draws etc. but I won’t do that here.
In brief, what this function does is to use dplyr to separate all the home and away results for each team and to create the following variables: teamname, opponent, goals for, goals against, goal difference, result (win, draw, loss) and venue (home or away). These two dataframes are then combined into a ‘dfboth’ dataframe. We can then rank the game number across a season using dense_rank from dplyr and then calculate all the cumulative variables using cumsum base-R function.
getdatafun1 <- function(df, teamname){
dfhome <- df %>% mutate(GF=as.numeric(as.character(hgoal)),
GA=as.numeric(as.character(vgoal)),
GD = GF-GA,
result=ifelse(GD>0, "W", ifelse(GD<0, "L", "D")),
venue="home") %>%
select(date, team=home, opponent=visitor, GF,GA,GD,result,venue)
dfaway <- df %>% mutate(GF=as.numeric(as.character(vgoal)),
GA=as.numeric(as.character(hgoal)),
GD = GF-GA,
result=ifelse(GD>0, "W", ifelse(GD<0, "L", "D")),
venue="away") %>%
select(date, team=visitor, opponent=home, GF,GA,GD,result,venue)
dfboth<-rbind(dfhome,dfaway)
dfboth <- dfboth %>%
filter(team==teamname) %>%
mutate(Gameno = dense_rank(date)) %>%
arrange(Gameno) %>%
mutate(Pts = ifelse(result=="W", 3, ifelse(result=="D", 1, 0))) %>%
mutate(Cumpts = cumsum(Pts),
CumGF = cumsum(GF),
CumGA = cumsum(GA),
CumGD = cumsum(GD)) %>%
select(Gameno, Cumpts, CumGF, CumGA, CumGD)
return(dfboth)
}
As an example, here is the function applied to Liverpool’s 1992/93 season (the first EPL season). Note that we are required to name the team we are interested in again - the function uses this 2nd argument to recognize what to filter from the inputted dataframe:
getdatafun1(myseasons[[1]], "Liverpool")
## Gameno Cumpts CumGF CumGA CumGD
## 1 1 0 0 1 -1
## 2 2 3 2 2 0
## 3 3 3 2 4 -2
## 4 4 4 4 6 -2
## 5 5 5 6 8 -2
## 6 6 6 7 9 -2
## 7 7 9 9 10 -1
## 8 8 9 9 11 -2
## 9 9 9 11 15 -4
## 10 10 9 13 18 -5
## 11 11 12 14 18 -4
## 12 12 13 16 20 -4
## 13 13 16 20 21 -1
## 14 14 16 20 23 -3
## 15 15 19 24 24 0
## 16 16 22 25 24 1
## 17 17 25 30 24 6
## 18 18 25 31 26 5
## 19 19 28 33 27 6
## 20 20 28 34 32 2
## 21 21 29 35 33 2
## 22 22 29 36 35 1
## 23 23 29 36 37 -1
## 24 24 32 37 37 0
## 25 25 33 37 37 0
## 26 26 34 37 37 0
## 27 27 34 38 39 -1
## 28 28 35 38 39 -1
## 29 29 36 39 40 -1
## 30 30 36 40 42 -2
## 31 31 39 41 42 -1
## 32 32 42 43 43 0
## 33 33 45 44 43 1
## 34 34 46 45 44 1
## 35 35 46 46 48 -2
## 36 36 49 47 48 -1
## 37 37 50 48 49 -1
## 38 38 53 52 49 3
## 39 39 56 54 49 5
## 40 40 56 54 50 4
## 41 41 56 56 53 3
## 42 42 59 62 55 7
OK - apply this function to all dataframes in myseasons using lapply, then convert to one long dataframe and add in a new variable (id) that refers to the season the data comes from:
x <- lapply(myseasons, getdatafun1, teamname="Liverpool")
alldata <- do.call("rbind", x)
alldata$id <- rep(names(x), sapply(x, nrow))
head(alldata, 10)
## Gameno Cumpts CumGF CumGA CumGD id
## 1992.1 1 0 0 1 -1 1992
## 1992.2 2 3 2 2 0 1992
## 1992.3 3 3 2 4 -2 1992
## 1992.4 4 4 4 6 -2 1992
## 1992.5 5 5 6 8 -2 1992
## 1992.6 6 6 7 9 -2 1992
## 1992.7 7 9 9 10 -1 1992
## 1992.8 8 9 9 11 -2 1992
## 1992.9 9 9 11 15 -4 1992
## 1992.10 10 9 13 18 -5 1992
My engsoccerdata package contains data up to the end of the 2013/14 season. I’ve decided to update it at the end of each season, so we will have to get the 2014/15 data from elsewhere. A straightforward way would be to manually enter the data - though I am generally anti doing anything manual when it comes to data entry.
A pretty straightforward approach would be just to scrape the data from ESPNFC. This webpage contains a list of all of this season’s Liverpool EPL games. Just change the filter options and use the appropriate adjusted web link if looking at another team/division.
I am using the readHTMLTable function from the XML libarary. The results of this are stored as a list of 3 elements. The 2nd element contains the data. Looking at the variable names and cross-referencing to the web page, we only need to keep variables 1 (date), 5 (Home), 7 (Away) and 6 (not named but contains the FT scoreline):
doc <- readHTMLTable("http://www.espn.co.uk/football/sport/match/index.html?event=3;team=305;type=results")
colnames(doc[[2]])
## [1] "Date" "BST/GMT" "TV" "" "Home" "" "Away"
## [8] "HT" "Attend." ""
I rename the variables - note this also contains games yet to be played:
temp <- doc[[2]][c(1,5:7)]
colnames(temp)<-c("date", "home", "FT", "visitor")
temp
## date home FT visitor
## 1 August 2014 <NA> <NA> <NA>
## 2 Sun 17 Liverpool 2-1 Southampton
## 3 Mon 25 Manchester City 3-1 Liverpool
## 4 Sun 31 Tottenham Hotspur 0-3 Liverpool
## 5 September 2014 <NA> <NA> <NA>
## 6 Sat 13 Liverpool 0-1 Aston Villa
## 7 Sat 20 West Ham United 3-1 Liverpool
## 8 Sat 27 Liverpool 1-1 Everton
## 9 October 2014 <NA> <NA> <NA>
## 10 Sat 4 Liverpool 2-1 West Bromwich Albion
## 11 Sun 19 Queens Park Rangers 2-3 Liverpool
## 12 Sat 25 Liverpool 0-0 Hull City
## 13 November 2014 <NA> <NA> <NA>
## 14 Sat 1 Newcastle United 1-0 Liverpool
## 15 Sat 8 Liverpool 1-2 Chelsea
## 16 Sun 23 Crystal Palace 3-1 Liverpool
## 17 Sat 29 Liverpool 1-0 Stoke City
## 18 December 2014 <NA> <NA> <NA>
## 19 Tue 2 Leicester City 1-3 Liverpool
## 20 Sat 6 Liverpool 0-0 Sunderland
## 21 Sun 14 Manchester United 3-0 Liverpool
## 22 Sun 21 Liverpool 2-2 Arsenal
## 23 Fri 26 Burnley 0-1 Liverpool
## 24 Mon 29 Liverpool 4-1 Swansea City
## 25 January 2015 <NA> <NA> <NA>
## 26 Thu 1 Liverpool 2-2 Leicester City
## 27 Sat 10 Sunderland v Liverpool
## 28 Sat 17 Aston Villa v Liverpool
## 29 Sat 31 Liverpool v West Ham United
## 30 February 2015 <NA> <NA> <NA>
## 31 Sat 7 Everton v Liverpool
## 32 Tue 10 Liverpool v Tottenham Hotspur
## 33 Sun 22 Southampton v Liverpool
## 34 March 2015 <NA> <NA> <NA>
## 35 Sun 1 Liverpool v Manchester City
## 36 Wed 4 Liverpool v Burnley
## 37 Sat 14 Swansea City v Liverpool
## 38 Sat 21 Liverpool v Manchester United
## 39 April 2015 <NA> <NA> <NA>
## 40 Sat 4 Arsenal v Liverpool
## 41 Sat 11 Liverpool v Newcastle United
## 42 Sat 18 Hull City v Liverpool
## 43 Sat 25 West Bromwich Albion v Liverpool
## 44 May 2015 <NA> <NA> <NA>
## 45 Sat 2 Liverpool v Queens Park Rangers
## 46 Sat 9 Chelsea v Liverpool
## 47 Sat 16 Liverpool v Crystal Palace
## 48 Sun 24 Stoke City v Liverpool
You will notice that the date variable is not yet nice and neat. For our purposes here, we could safely just delete the blank rows where each new month is listed as the results are already in date order. However, just for completeness sake, I shall create a proper date variable and sort by this.
My strategy here is to create an empty vector of the original date variable length and to add in the month/years to that empty vector at the position that they occur in. This essentially means only copying the date variable when there is a “NA” in any of the other three variables.
tempx <- NULL
tempx <- ifelse(is.na(temp$home)==T, as.character(temp$date), NA) #as.character() because 'date' is a factor
tempx
## [1] "August 2014" NA NA NA
## [5] "September 2014" NA NA NA
## [9] "October 2014" NA NA NA
## [13] "November 2014" NA NA NA
## [17] NA "December 2014" NA NA
## [21] NA NA NA NA
## [25] "January 2015" NA NA NA
## [29] NA "February 2015" NA NA
## [33] NA "March 2015" NA NA
## [37] NA NA "April 2015" NA
## [41] NA NA NA "May 2015"
## [45] NA NA NA NA
Now I fill down the NAs in the new ‘tempx’ vector using the na.locf in zoo package. This will copy a character element through empty elements until it reaches a new character element. It is in my top 5 favorite R functions! A quick cautionary note: this function does mask the as.Date function in base-R which we will need later, so I am not loading the zoo library, but just calling the function directly from it using :: :
tempx <- zoo::na.locf(tempx)
tempx
## [1] "August 2014" "August 2014" "August 2014" "August 2014"
## [5] "September 2014" "September 2014" "September 2014" "September 2014"
## [9] "October 2014" "October 2014" "October 2014" "October 2014"
## [13] "November 2014" "November 2014" "November 2014" "November 2014"
## [17] "November 2014" "December 2014" "December 2014" "December 2014"
## [21] "December 2014" "December 2014" "December 2014" "December 2014"
## [25] "January 2015" "January 2015" "January 2015" "January 2015"
## [29] "January 2015" "February 2015" "February 2015" "February 2015"
## [33] "February 2015" "March 2015" "March 2015" "March 2015"
## [37] "March 2015" "March 2015" "April 2015" "April 2015"
## [41] "April 2015" "April 2015" "April 2015" "May 2015"
## [45] "May 2015" "May 2015" "May 2015" "May 2015"
Now, we paste together the contents of the scraped ‘date’ variable with this new vector to get each game’s actual date:
paste(temp$date, tempx, sep=" ")
## [1] "August 2014 August 2014" "Sun 17 August 2014"
## [3] "Mon 25 August 2014" "Sun 31 August 2014"
## [5] "September 2014 September 2014" "Sat 13 September 2014"
## [7] "Sat 20 September 2014" "Sat 27 September 2014"
## [9] "October 2014 October 2014" "Sat 4 October 2014"
## [11] "Sun 19 October 2014" "Sat 25 October 2014"
## [13] "November 2014 November 2014" "Sat 1 November 2014"
## [15] "Sat 8 November 2014" "Sun 23 November 2014"
## [17] "Sat 29 November 2014" "December 2014 December 2014"
## [19] "Tue 2 December 2014" "Sat 6 December 2014"
## [21] "Sun 14 December 2014" "Sun 21 December 2014"
## [23] "Fri 26 December 2014" "Mon 29 December 2014"
## [25] "January 2015 January 2015" "Thu 1 January 2015"
## [27] "Sat 10 January 2015" "Sat 17 January 2015"
## [29] "Sat 31 January 2015" "February 2015 February 2015"
## [31] "Sat 7 February 2015" "Tue 10 February 2015"
## [33] "Sun 22 February 2015" "March 2015 March 2015"
## [35] "Sun 1 March 2015" "Wed 4 March 2015"
## [37] "Sat 14 March 2015" "Sat 21 March 2015"
## [39] "April 2015 April 2015" "Sat 4 April 2015"
## [41] "Sat 11 April 2015" "Sat 18 April 2015"
## [43] "Sat 25 April 2015" "May 2015 May 2015"
## [45] "Sat 2 May 2015" "Sat 9 May 2015"
## [47] "Sat 16 May 2015" "Sun 24 May 2015"
Next, we add this to the dataframe as a new variable and remove incomplete rows (i.e. those with NAs that originally only contained month/year information):
temp$date <- paste(temp$date, tempx, sep=" ")
temp <- temp[complete.cases(temp),]
Let’s convert this variable to R’s date format:
as.Date(temp$date, format="%a %d %B %Y")
## [1] "2014-08-17" "2014-08-25" "2014-08-31" "2014-09-13" "2014-09-20"
## [6] "2014-09-27" "2014-10-04" "2014-10-19" "2014-10-25" "2014-11-01"
## [11] "2014-11-08" "2014-11-23" "2014-11-29" "2014-12-02" "2014-12-06"
## [16] "2014-12-14" "2014-12-21" "2014-12-26" "2014-12-29" "2015-01-01"
## [21] "2015-01-10" "2015-01-17" "2015-01-31" "2015-02-07" "2015-02-10"
## [26] "2015-02-22" "2015-03-01" "2015-03-04" "2015-03-14" "2015-03-21"
## [31] "2015-04-04" "2015-04-11" "2015-04-18" "2015-04-25" "2015-05-02"
## [36] "2015-05-09" "2015-05-16" "2015-05-24"
temp$date <- as.Date(temp$date, format="%a %d %B %Y")
And then, get rid of games yet to be played - I do this by using grepl to determine if the FT variable contains a ‘v’ or not and only keep those that do not:
grepl("v", as.character(temp$FT))
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [23] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [34] TRUE TRUE TRUE TRUE TRUE
temp <- temp[grepl("v", as.character(temp$FT))==F, ]
temp
## date home FT visitor
## 2 2014-08-17 Liverpool 2-1 Southampton
## 3 2014-08-25 Manchester City 3-1 Liverpool
## 4 2014-08-31 Tottenham Hotspur 0-3 Liverpool
## 6 2014-09-13 Liverpool 0-1 Aston Villa
## 7 2014-09-20 West Ham United 3-1 Liverpool
## 8 2014-09-27 Liverpool 1-1 Everton
## 10 2014-10-04 Liverpool 2-1 West Bromwich Albion
## 11 2014-10-19 Queens Park Rangers 2-3 Liverpool
## 12 2014-10-25 Liverpool 0-0 Hull City
## 14 2014-11-01 Newcastle United 1-0 Liverpool
## 15 2014-11-08 Liverpool 1-2 Chelsea
## 16 2014-11-23 Crystal Palace 3-1 Liverpool
## 17 2014-11-29 Liverpool 1-0 Stoke City
## 19 2014-12-02 Leicester City 1-3 Liverpool
## 20 2014-12-06 Liverpool 0-0 Sunderland
## 21 2014-12-14 Manchester United 3-0 Liverpool
## 22 2014-12-21 Liverpool 2-2 Arsenal
## 23 2014-12-26 Burnley 0-1 Liverpool
## 24 2014-12-29 Liverpool 4-1 Swansea City
## 26 2015-01-01 Liverpool 2-2 Leicester City
Finally, add the ‘hgoal’ and ‘vgoal’ variables that we use to calculate cumulative goals and results. I do this using separate from tidyr - the remove=F makes sure that the original ‘FT’ variable is kept:
temp <- temp %>% separate(FT, c("hgoal", "vgoal"), sep="-", remove=F)
temp
## date home FT hgoal vgoal visitor
## 2 2014-08-17 Liverpool 2-1 2 1 Southampton
## 3 2014-08-25 Manchester City 3-1 3 1 Liverpool
## 4 2014-08-31 Tottenham Hotspur 0-3 0 3 Liverpool
## 6 2014-09-13 Liverpool 0-1 0 1 Aston Villa
## 7 2014-09-20 West Ham United 3-1 3 1 Liverpool
## 8 2014-09-27 Liverpool 1-1 1 1 Everton
## 10 2014-10-04 Liverpool 2-1 2 1 West Bromwich Albion
## 11 2014-10-19 Queens Park Rangers 2-3 2 3 Liverpool
## 12 2014-10-25 Liverpool 0-0 0 0 Hull City
## 14 2014-11-01 Newcastle United 1-0 1 0 Liverpool
## 15 2014-11-08 Liverpool 1-2 1 2 Chelsea
## 16 2014-11-23 Crystal Palace 3-1 3 1 Liverpool
## 17 2014-11-29 Liverpool 1-0 1 0 Stoke City
## 19 2014-12-02 Leicester City 1-3 1 3 Liverpool
## 20 2014-12-06 Liverpool 0-0 0 0 Sunderland
## 21 2014-12-14 Manchester United 3-0 3 0 Liverpool
## 22 2014-12-21 Liverpool 2-2 2 2 Arsenal
## 23 2014-12-26 Burnley 0-1 0 1 Liverpool
## 24 2014-12-29 Liverpool 4-1 4 1 Swansea City
## 26 2015-01-01 Liverpool 2-2 2 2 Leicester City
temp2014 <- getdatafun1(temp, "Liverpool")
temp2014$id <- 2014
head(temp2014)
## Gameno Cumpts CumGF CumGA CumGD id
## 1 1 3 2 1 1 2014
## 2 2 3 3 4 -1 2014
## 3 3 6 6 4 2 2014
## 4 4 6 6 5 1 2014
## 5 5 6 7 8 -1 2014
## 6 6 7 8 9 -1 2014
Bind the data from previous seasons and 2014/15 together:
mydf<-rbind(alldata,temp2014)
We are going to visualize the ‘mydf’ dataframe. First a couple of little extra bits. Add a row for every season in the dataframe with each variable of interest being equal to 0 - this is so all the lines start from 0 and not at 1.
tmp <- data.frame(Gameno=0,Cumpts=0, CumGF=0, CumGA=0, CumGD=0, id=unique(mydf$id))
mydf<-rbind(mydf,tmp)
Now add a variable containing two groups. A ‘0’ indicates all the lines/seasons that we’ll make gray, the ‘1’ indicates the lines/seasons that we’ll make a different color.
mydf$grp <- ifelse(mydf$id <2014, 0, 1)
head(mydf)
## Gameno Cumpts CumGF CumGA CumGD id grp
## 1992.1 1 0 0 1 -1 1992 0
## 1992.2 2 3 2 2 0 1992 0
## 1992.3 3 3 2 4 -2 1992 0
## 1992.4 4 4 4 6 -2 1992 0
## 1992.5 5 5 6 8 -2 1992 0
## 1992.6 6 6 7 9 -2 1992 0
Lastly, I’m subsetting a dataframe only for the current season - this will help me make this line thicker:
x2014 <- mydf %>% filter(id==2014)
ggplot(mydf, aes(Gameno, CumGF, group=id, color=grp)) +
geom_line(aes(group=id, color=factor(grp))) +
geom_line(data=x2014, aes(Gameno, CumGF, group=id, color=factor(grp)), lwd=1.5) +
xlab("Game number") + ylab("Cumulative goals") +
scale_color_manual(values=c("gray80", "red")) +
ggtitle("Liverpool - Premier League goals by game") +
theme(
plot.title = element_text(hjust=0,vjust=1, size=rel(1.7)),
panel.background = element_blank(),
panel.grid.major.y = element_line(color="gray65"),
panel.grid.major.x = element_line(color="gray65"),
panel.grid.minor = element_blank(),
plot.background = element_blank(),
text = element_text(color="gray20", size=10),
axis.text = element_text(size=rel(1.0)),
axis.text.x = element_text(color="gray20",size=rel(1.5)),
axis.text.y = element_text(color="gray20", size=rel(1.5)),
axis.title.x = element_text(size=rel(1.5), vjust=0),
axis.title.y = element_text(size=rel(1.5), vjust=1),
axis.ticks.y = element_blank(),
axis.ticks.x = element_blank(),
legend.position = "none"
)
You can annotate ggplots using geom_text but in my experience, it can be quite fiddly. For a graph like this, I prefer manual editing in an illustrator or paint type package. For instance, to identify some of the outlying lines/season, we can do this:
mydf %>% filter(Gameno==20) %>% arrange(desc(CumGF))
## Gameno Cumpts CumGF CumGA CumGD id grp
## 1 20 39 46 23 23 2013 0
## 2 20 39 37 19 18 1996 0
## 3 20 33 37 25 12 2000 0
## 4 20 33 37 25 12 2009 0
## 5 20 36 36 19 17 1994 0
## 6 20 35 36 18 18 1995 0
## 7 20 37 36 19 17 1997 0
## 8 20 31 36 25 11 1998 0
## 9 20 45 35 13 22 2008 0
## 10 20 28 34 32 2 1992 0
## 11 20 34 34 20 14 2004 0
## 12 20 38 34 13 21 2007 0
## 13 20 31 33 26 7 1993 0
## 14 20 37 31 17 14 1999 0
## 15 20 28 31 26 5 2012 0
## 16 20 38 30 20 10 2001 0
## 17 20 32 30 21 9 2003 0
## 18 20 33 29 20 9 2002 0
## 19 20 44 29 11 18 2005 0
## 20 20 34 28 16 12 2006 0
## 21 20 29 28 27 1 2014 1
## 22 20 25 24 27 -3 2010 0
## 23 20 34 24 18 6 2011 0
Adding this information to the chart gives this:
I shall put all this code into one chunk - hopefully by reading the above you can see how the chart is made. Let’s look at the cumulative points of Manchester United in the EPL by manager:
#1992/93 - 2014/15 data
df1 <- engsoccerdata2
df1$date <- as.Date(df1$Date, format="%Y-%m-%d")
df1 <- df1 %>% filter(home=="Manchester United" | visitor=="Manchester United") %>% filter(Season>=1992)
myseasons1 <- split(df1,df1$Season)
x1 <- lapply(myseasons1, getdatafun1, teamname="Manchester United")
alldata1 <- do.call("rbind", x1)
alldata1$id <- rep(names(x1), sapply(x1, nrow))
#2014/15 data
doc1 <- readHTMLTable("http://www.espn.co.uk/football/sport/match/index.html?event=3;team=311;type=results")
temp1 <- doc1[[2]][c(1,5:7)]
colnames(temp1)<-c("date", "home", "FT", "visitor")
tempx1 <- NULL
tempx1 <- ifelse(is.na(temp1$home)==T, as.character(temp1$date), NA) #as.character() because 'date' is a factor
tempx1 <- zoo::na.locf(tempx1)
temp1$date <- paste(temp1$date, tempx1, sep=" ")
temp1 <- temp1[complete.cases(temp1),]
temp1$date <- as.Date(temp1$date, format="%a %d %B %Y")
temp1 <- temp1[grepl("v", as.character(temp1$FT))==F, ]
temp1 <- temp1 %>% separate(FT, c("hgoal", "vgoal"), sep="-", remove=F)
temp2014a <- getdatafun1(temp1, "Manchester United")
temp2014a$id <- 2014
# combine data
mydf1<-rbind(alldata1,temp2014a)
tmp1 <- data.frame(Gameno=0,Cumpts=0, CumGF=0, CumGA=0, CumGD=0, id=unique(mydf1$id))
mydf1<-rbind(mydf1,tmp1)
mydf1$grp <- ifelse(mydf1$id <2013, 0,
ifelse(mydf1$id ==2013, 1,
2)
)
mu2014 <- mydf1 %>% filter(id==2014)
mu2013 <- mydf1 %>% filter(id==2013)
## Visualize
ggplot(mydf1, aes(Gameno, Cumpts, group=id, color=grp)) +
geom_line(aes(group=id, color=factor(grp))) +
geom_line(data=mu2013, aes(Gameno, Cumpts, group=id, color=factor(grp)), lwd=1.1) +
geom_line(data=mu2014, aes(Gameno, Cumpts, group=id, color=factor(grp)), lwd=1.1) +
xlab("Game number") + ylab("Cumulative points") +
scale_color_manual(values=c("gray80", "red4", "red")) +
ggtitle("Man Utd - Premier League points by game") +
theme(
plot.title = element_text(hjust=0,vjust=1, size=rel(1.7)),
panel.background = element_blank(),
panel.grid.major.y = element_line(color="gray65"),
panel.grid.major.x = element_line(color="gray65"),
panel.grid.minor = element_blank(),
plot.background = element_blank(),
text = element_text(color="gray20", size=10),
axis.text = element_text(size=rel(1.0)),
axis.text.x = element_text(color="gray20",size=rel(1.5)),
axis.text.y = element_text(color="gray20", size=rel(1.5)),
axis.title.x = element_text(size=rel(1.5), vjust=0),
axis.title.y = element_text(size=rel(1.5), vjust=1),
axis.ticks.y = element_blank(),
axis.ticks.x = element_blank(),
legend.position = "none"
)
… and with some extra edits…