Abtract

This is an natural language analysis on the matching soccer teams’ name when I am doing research on Betting Strategy and Model Validation. The purpose of writing the functions just would like easier future scrap teams name for further calculation to reduce my workload.

1. Natural Language

As we know there are always different bookmakers using different teams’ name since they all are independence. There are quite a lot hedging application which also apply regular expression method to match the scrapped data from websites, somemore apply real-time API connection in order to hedge the odds price among bookmakers on spot or provides real-time information service to clients. For example :

Today I am trying to scrap and matching the teams name with R.

2. Read and Process the Dataset

Read the dataset of World Wide soccer matches from year 2011 until 2015 from a British betting consultancy named firm A.

table 2.1 48744 x 20

Due to the dataset very big 48744 x 20 caused the webpage keep loading and unable open. Here I just only subset few rows from the data frame.

Read the dataset of World Wide soccer matches scrapped from year 2011 until 2015 from spbo livescore website.

table 2.2 5798529 x 20

Due to the livescore dataset contains alot of unrelated soccer matches 5798529 x 20, here I only subset few rows from the data frame for viewing purpose.

3. Matching the team names

3.1 Matching Duplicated Teams’ Name

In order to matching a string. Firstly we can apply match() or %in% to matching the teams’ name. Although, the capital letter different is not duplicated string in R programming while I apply the tolower() to match the teams’ name since it is consider exactly matching teams’ name in our real life.

Table 3.1.1 : Exactly match and capital letters difference.
team spbo pass
Aachen Aachen Duplicated
Aalesund Aalesund Duplicated
Aarau Aarau Duplicated
12 de Octubre 12 De Octubre Capital Letters
Argentinos Juniors Argentinos juniors Capital Letters
Jippo JIPPO Capital Letters

table 3.1.1 1064 x 3

3.2 Apply amatch() and stringdist()

There has a concern which is noramlly second teams’ name must be exactly same with first team but only add II, reserved etc to the first team name, for example : Mainz 05 is first team but not fifth reserved team. More soccer matches data scrapped will be more accurate, for example if we only scrapped one day data, how can we matching the first team if let say only Chelsea reserved team play on that particular date.

However there has another concern which is first team TSV 1860 Munchen but second/U19 team termed as 1860 Munchen II, 1860 Munchen U19 etc. The Lincoln team name supposed to be matched with Lincoln City but not Lincoln United while Lincoln City will be most approximately matching to Lincoln Xxitxx compare to Lincoln.

Besides, if I set the priority of matching the kick-off date and later team names, it will be a concern of possibilities of postponed staked matches (postponed after firm A placed bets, sometimes firm A will placed bets on Early market or the kick-off date accidentially changed/postponed before kick-off due to snowing/downpour/etc).

I load the stringdist package to apply the algorithmic matching amatch() the team names.

    1. osa - Optimal string aligment, (restricted Damerau-Levenshtein distance).
    1. lv - Levenshtein distance (as in R’s native adist).
    1. dl - Full Damerau-Levenshtein distance.
    1. hamming - Hamming distance (a and b must have same nr of characters).
    1. lcs - Longest common substring distance.
    1. qgram - q-gram distance.
    1. cosine - cosine distance between q-gram profiles.
    1. jaccard - Jaccard distance between q-gram profiles.
    1. jw - Jaro, or Jaro-Winker distance.
    1. soundex - Distance based on soundex encoding (see below).

Lets take an example below.

[1] “Lincoln City”

table 3.2.1 10 x 12

I simply matching the key words Lincoln in Home and Away teams’ name data which get from firm A.

table 3.2.2 10 x 12

From the two tables stated above, I apply stringdist by set the MaxDist to be default value 0.1,0.5,1.0,2.0 and also Inf and select all methods avaiable (10 methods stated above in section 3 before the run coding). Well, I dont pretend to know how does the algorimthic of stringdist() matching the string. Therefore I try both unique teams’ name and also all elements (without filter to be unique).

3.3 Apply agrep()

I tried to simply apply the agrep() function to partially matching the teams’ name.

Table 3.3.1 : Simply apply agrep().
Matching1 team1 spbo1 Matching2 team2 spbo2
Lincoln Lincoln City Lincoln Lincoln City Lincoln City NA
Lincoln NA Lincoln Red Imps Lincoln City NA NA
Lincoln NA Lincoln Reserve Lincoln City NA NA
Lincoln NA Lincoln United Lincoln City NA NA
Lincoln NA Lincoln Women Lincoln City NA NA
Lincoln NA Rivadavia Lincoln Lincoln City NA NA

table 3.3.1 6 x 6

3.4 Apply partialMatch()

Secondly, there is an article from Merging Data Sets Based on Partially Matched Data Elements which apply subset to partial matching the teams’ name.

Below table simply display few matched teams’ name which are not accurate.

Table 3.4.2 : Inaccuracy of Matching Result.
teamID spboID Match
AaB Aalborg AaB Aalborg U17 Partial
Airdrie United Airdrie United Women Partial
AS Trencin AS Trencin U19 Partial
Gremio Barueri Gremio Barueri SP U20 Partial
Sheffield United Chesterfield United Women Partial
Estudiantes Tecos Estudiantes Tecos U20 Partial

table 3.4.2 1156 x 3

From the table above we all know that the team AaB Aalborg from firm A will match with AaB Aalborg U17 from livescore website and Airdrie United match to Airdrie United Women while there are totally different team and will lead reasearcher calculate a wrong predictive figures for investment.

In order to maximized the soccer matches (observations) available for the research, here I seperates few steps to matching the teams’ name by using split() and cross-matching each others to seperately rearrange the data prior to start the algorithmic matching function in section 4 Reprocess the Data.

4. Reprocess the Data

4.1 Dicission Tree

I would like to plot a hierarchical chart for spliting teams’ name for agrep. However due to rpart and randomForest packages required numeric data while diagram doesn’t special. Here I plot two dynamic graphs.

Since the simpleNetwork() function only apply to 2 columns dataset, here I split to be 2 graphs.

4.2 Filtering and Reprocess the Data

Prior to start the algorithmic string matching, I am using the idea from Apply signature() from country names to reduce some of the minor differences between strings. In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces. So for example, United Kingdom would become kingdomunited which inside the Merging Data Sets Based on Partially Matched Data Elements. It will minimize/reduce the string distance to maximize the matching result.

Here I tried to split teams’ name into list and simply apply grep and agrep to apply first filtering.

4.3 StringDist Maximum Likelihood

There is an good example from How can I match fuzzy match strings from two datasets? which apply expand.grid() to build a data frame and then Expectation Maximization theory by using while loop on stringdist().

From the above table, I’ve matching the teams’ name which is Section 2 Dataset inside Betting Strategy and Model Validation. Here I apply method = osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, soundex inside the stringdist function. You are feel free to apply the function to scrap and also re-arrange the teams’ name and soccer scores data for your own odds price modelling.

5. Result

5.1 Checked and Filtered the Teams’ Name

Here I tried to manually check the teams’ name and compile as a file to compare the accuracy of the stringDist().

5.2 Comparison of the Model

Firstly, we try to filter-up the teams’ name.

Secondly, we simply compare the accuracy and also number of teams.

Table 5.2.2 : Summary of Matching Result 1
match rate n
spbo spbo 1.0000000 1395
osa osa 0.9333333 1302
lv lv 0.9333333 1302
dl dl 0.9333333 1302
hamming hamming 0.9333333 1302
lcs lcs 0.9333333 1302
qgram qgram 0.9333333 1302
cosine cosine 0.9333333 1302
jaccard jaccard 0.9333333 1302
jw jw 0.9333333 1302
soundex soundex 0.9333333 1302

Same with above, we simply filter the PartialMatch function.

Here we also summarized the table.

Table 5.2.4 : Summary of Matching Result 2
match rate n
spbo spbo 1.0000000 1156
PartialMatch PartialMatch 0.9801038 1133

Based from the above two functions, we know that modified stringdist() which is stringDistList() has correctly gather 1302 teams from 1395 teams. Meanwhile partialMatch() has matched 1133 teams from 1156 teams. More teams correctly gathered the information to diversify the investment opportunity on different leagues.

5.3 Future Works

There will be more accurate to approximately matching if I apply multivariate matching kick-off time and also both home team and away team at once. I was initially tried to match the teams name by criteria kick-off time but the kick-off time will sometimes unexpected change few hours prior to kick-off.

I will also write as a package to easier load and log.

6. Appendices

6.1 Documenting File Creation

It’s useful to record some information about how your file was created.

  • File creation date: 2015-10-29
  • R version 3.2.3 (2015-12-10)
  • R version (short form): 3.2.3
  • rmarkdown package version: 0.9.2
  • File version: 1.0.3
  • File latest updated date: 2016-02-19
  • Author Profile: ®γσ, Eng Lian Hu
  • GitHub: Source Code
  • Additional session information

[1] “2016-02-19 23:13:27 EST” setting value
version R version 3.2.3 (2015-12-10) system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
tz America/New_York
date 2016-02-19
sysname release “Linux” “3.10.0-229.20.1.el7.x86_64” version nodename “#1 SMP Tue Nov 3 19:10:07 UTC 2015” “rstudio-scibrokes” machine login “x86_64” “unknown” user effective_user “ryoeng” “ryoeng”

6.2 Versions’ Log

  • File version: 1.0.0
    • file created
    • Applied regular expression to filter and approximately matching the soccer teams name.
  • File version: 1.0.1
    • Changed to use datatble from DT package to make the tables be dynamic.
  • File version: 1.0.2 - “2015-11-22 09:41:49 JST”
    • Added Blooper since retest the coding
    • Added Section 5. Result which make comparison between models.
  • File version: 1.0.3 - “2016-02-05 05:24:35 EST”

6.3 Speech and Blooper

There are quite some errors when I knit HTML:

  • let say always stuck (which is not response and consider as completed) at 29%. I tried couple times while sometimes prompt me different errors (upgrade Droplet to larger RAM memory space doesn’t helps) and eventually apply rm() and gc() to remove the object after use and also clear the memory space.

  • Need to reload the package suppressAll(library('networkD3')) which in chunk decission-tree-A prior to apply function simpleNetwork while I load it in chunk libs at the beginning of the section 1. Otherwise cannot found that particlar function.