This is an natural language analysis on the matching soccer teams’ name when I am doing research on Betting Strategy and Model Validation. The purpose of writing the functions just would like easier future scrap teams name for further calculation to reduce my workload.
As we know there are always different bookmakers using different teams’ name since they all are independence. There are quite a lot hedging application which also apply regular expression method to match the scrapped data from websites, somemore apply real-time API connection in order to hedge the odds price among bookmakers on spot or provides real-time information service to clients. For example :
Today I am trying to scrap and matching the teams name with R.
Read the dataset of World Wide soccer matches from year 2011 until 2015 from a British betting consultancy named firm A.
table 2.1 48744 x 20
Due to the dataset very big 48744 x 20 caused the webpage keep loading and unable open. Here I just only subset few rows from the data frame.
Read the dataset of World Wide soccer matches scrapped from year 2011 until 2015 from spbo livescore website.
table 2.2 5798529 x 20
Due to the livescore dataset contains alot of unrelated soccer matches 5798529 x 20, here I only subset few rows from the data frame for viewing purpose.
In order to matching a string. Firstly we can apply match()
or %in%
to matching the teams’ name. Although, the capital letter different is not duplicated string in R programming while I apply the tolower()
to match the teams’ name since it is consider exactly matching teams’ name in our real life.
team | spbo | pass |
---|---|---|
Aachen | Aachen | Duplicated |
Aalesund | Aalesund | Duplicated |
Aarau | Aarau | Duplicated |
12 de Octubre | 12 De Octubre | Capital Letters |
Argentinos Juniors | Argentinos juniors | Capital Letters |
Jippo | JIPPO | Capital Letters |
table 3.1.1 1064 x 3
There has a concern which is noramlly second teams’ name must be exactly same with first team but only add II, reserved etc to the first team name, for example : Mainz 05 is first team but not fifth reserved team. More soccer matches data scrapped will be more accurate, for example if we only scrapped one day data, how can we matching the first team if let say only Chelsea reserved team play on that particular date.
However there has another concern which is first team TSV 1860 Munchen but second/U19 team termed as 1860 Munchen II, 1860 Munchen U19 etc. The Lincoln team name supposed to be matched with Lincoln City but not Lincoln United while Lincoln City will be most approximately matching to Lincoln Xxitxx compare to Lincoln.
Besides, if I set the priority of matching the kick-off date and later team names, it will be a concern of possibilities of postponed staked matches (postponed after firm A placed bets, sometimes firm A will placed bets on Early market or the kick-off date accidentially changed/postponed before kick-off due to snowing/downpour/etc).
I load the stringdist
package to apply the algorithmic matching amatch()
the team names.
Lets take an example below.
[1] “Lincoln City”
table 3.2.1 10 x 12
I simply matching the key words Lincoln
in Home and Away teams’ name data which get from firm A.
table 3.2.2 10 x 12
From the two tables stated above, I apply stringdist by set the MaxDist to be default value 0.1
,0.5
,1.0
,2.0
and also Inf
and select all methods avaiable (10 methods stated above in section 3 before the run coding). Well, I dont pretend to know how does the algorimthic of stringdist()
matching the string. Therefore I try both unique teams’ name and also all elements (without filter to be unique).
I tried to simply apply the agrep()
function to partially matching the teams’ name.
Matching1 | team1 | spbo1 | Matching2 | team2 | spbo2 |
---|---|---|---|---|---|
Lincoln | Lincoln City | Lincoln | Lincoln City | Lincoln City | NA |
Lincoln | NA | Lincoln Red Imps | Lincoln City | NA | NA |
Lincoln | NA | Lincoln Reserve | Lincoln City | NA | NA |
Lincoln | NA | Lincoln United | Lincoln City | NA | NA |
Lincoln | NA | Lincoln Women | Lincoln City | NA | NA |
Lincoln | NA | Rivadavia Lincoln | Lincoln City | NA | NA |
table 3.3.1 6 x 6
Secondly, there is an article from Merging Data Sets Based on Partially Matched Data Elements which apply subset to partial matching the teams’ name.
Below table simply display few matched teams’ name which are not accurate.
teamID | spboID | Match |
---|---|---|
AaB Aalborg | AaB Aalborg U17 | Partial |
Airdrie United | Airdrie United Women | Partial |
AS Trencin | AS Trencin U19 | Partial |
Gremio Barueri | Gremio Barueri SP U20 | Partial |
Sheffield United | Chesterfield United Women | Partial |
Estudiantes Tecos | Estudiantes Tecos U20 | Partial |
table 3.4.2 1156 x 3
From the table above we all know that the team AaB Aalborg
from firm A will match with AaB Aalborg U17
from livescore website and Airdrie United
match to Airdrie United Women
while there are totally different team and will lead reasearcher calculate a wrong predictive figures for investment.
In order to maximized the soccer matches (observations) available for the research, here I seperates few steps to matching the teams’ name by using split()
and cross-matching each others to seperately rearrange the data prior to start the algorithmic matching function in section 4 Reprocess the Data.
I would like to plot a hierarchical chart for spliting teams’ name for agrep
. However due to rpart
and randomForest
packages required numeric data while diagram doesn’t special. Here I plot two dynamic graphs.
Since the simpleNetwork()
function only apply to 2 columns dataset, here I split to be 2 graphs.
Prior to start the algorithmic string matching, I am using the idea from Apply signature() from country names to reduce some of the minor differences between strings. In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces. So for example, United Kingdom would become kingdomunited
which inside the Merging Data Sets Based on Partially Matched Data Elements. It will minimize/reduce the string distance to maximize the matching result.
Here I tried to split
teams’ name into list and simply apply grep
and agrep
to apply first filtering.
There is an good example from How can I match fuzzy match strings from two datasets? which apply expand.grid()
to build a data frame and then Expectation Maximization theory by using while loop on stringdist()
.
From the above table, I’ve matching the teams’ name which is Section 2 Dataset inside Betting Strategy and Model Validation. Here I apply method = osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, soundex inside the stringdist
function. You are feel free to apply the function to scrap and also re-arrange the teams’ name and soccer scores data for your own odds price modelling.
Here I tried to manually check the teams’ name and compile as a file to compare the accuracy of the stringDist().
Firstly, we try to filter-up the teams’ name.
Secondly, we simply compare the accuracy and also number of teams.
match | rate | n | |
---|---|---|---|
spbo | spbo | 1.0000000 | 1395 |
osa | osa | 0.9333333 | 1302 |
lv | lv | 0.9333333 | 1302 |
dl | dl | 0.9333333 | 1302 |
hamming | hamming | 0.9333333 | 1302 |
lcs | lcs | 0.9333333 | 1302 |
qgram | qgram | 0.9333333 | 1302 |
cosine | cosine | 0.9333333 | 1302 |
jaccard | jaccard | 0.9333333 | 1302 |
jw | jw | 0.9333333 | 1302 |
soundex | soundex | 0.9333333 | 1302 |
Same with above, we simply filter the PartialMatch
function.
Here we also summarized the table.
match | rate | n | |
---|---|---|---|
spbo | spbo | 1.0000000 | 1156 |
PartialMatch | PartialMatch | 0.9801038 | 1133 |
Based from the above two functions, we know that modified stringdist()
which is stringDistList()
has correctly gather 1302 teams from 1395 teams. Meanwhile partialMatch()
has matched 1133 teams from 1156 teams. More teams correctly gathered the information to diversify the investment opportunity on different leagues.
There will be more accurate to approximately matching if I apply multivariate matching kick-off time and also both home team and away team at once. I was initially tried to match the teams name by criteria kick-off time but the kick-off time will sometimes unexpected change few hours prior to kick-off.
I will also write as a package to easier load and log.
It’s useful to record some information about how your file was created.
rmarkdown
package version: 0.9.2[1] “2016-02-19 23:13:27 EST” setting value
version R version 3.2.3 (2015-12-10) system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
tz America/New_York
date 2016-02-19
sysname release “Linux” “3.10.0-229.20.1.el7.x86_64” version nodename “#1 SMP Tue Nov 3 19:10:07 UTC 2015” “rstudio-scibrokes” machine login “x86_64” “unknown” user effective_user “ryoeng” “ryoeng”
There are quite some errors when I knit HTML:
let say always stuck (which is not response and consider as completed) at 29%. I tried couple times while sometimes prompt me different errors (upgrade Droplet to larger RAM memory space doesn’t helps) and eventually apply rm() and gc()
to remove the object after use and also clear the memory space.
Need to reload the package suppressAll(library('networkD3'))
which in chunk decission-tree-A
prior to apply function simpleNetwork
while I load it in chunk libs
at the beginning of the section 1. Otherwise cannot found that particlar function.
Powered by - Copyright© Intellectual Property Rights of Scibrokes®個人の経営企業