Natural Language Analysis

Abtract

This is an natural language analysis on the matching soccer teams’ name when I am doing research on Betting Strategy and Model Validation. The purpose of writing the functions just would like easier future scrap teams name for further calculation to reduce my workload.

1. Natural Language

As we know there are always different bookmakers using different teams’ name since they all are independence. There are quite a lot hedging application which also apply regular expression method to match the scrapped data from websites, somemore apply real-time API connection in order to hedge the odds price among bookmakers on spot or provides real-time information service to clients. For example :

Today I am trying to scrap and matching the teams name with R.

2. Read and Process the Dataset

Read the dataset of World Wide soccer matches from year 2011 until 2015 from a British betting consultancy named firm A.

table 2.1 48744 x 20

Due to the dataset very big 48744 x 20 caused the webpage keep loading and unable open. Here I just only subset few rows from the data frame.

Read the dataset of World Wide soccer matches scrapped from year 2011 until 2015 from spbo livescore website.

table 2.2 5798529 x 20

Due to the livescore dataset contains alot of unrelated soccer matches 5798529 x 20, here I only subset few rows from the data frame for viewing purpose.

3.1 Matching Duplicated Teams’ Name

In order to matching a string. Firstly we can apply match() or %in% to matching the teams’ name. Although, the capital letter different is not duplicated string in R programming while I apply the tolower() to match the teams’ name since it is consider exactly matching teams’ name in our real life.

Table 3.1.1 : Exactly match and capital letters difference.
team	spbo	pass
Aachen	Aachen	Duplicated
Aalesund	Aalesund	Duplicated
Aarau	Aarau	Duplicated
12 de Octubre	12 De Octubre	Capital Letters
Argentinos Juniors	Argentinos juniors	Capital Letters
Jippo	JIPPO	Capital Letters

table 3.1.1 1064 x 3

3.2 Apply amatch() and stringdist()

There has a concern which is noramlly second teams’ name must be exactly same with first team but only add II, reserved etc to the first team name, for example : Mainz 05 is first team but not fifth reserved team. More soccer matches data scrapped will be more accurate, for example if we only scrapped one day data, how can we matching the first team if let say only Chelsea reserved team play on that particular date.

However there has another concern which is first team TSV 1860 Munchen but second/U19 team termed as 1860 Munchen II, 1860 Munchen U19 etc. The Lincoln team name supposed to be matched with Lincoln City but not Lincoln United while Lincoln City will be most approximately matching to Lincoln Xxitxx compare to Lincoln.

Besides, if I set the priority of matching the kick-off date and later team names, it will be a concern of possibilities of postponed staked matches (postponed after firm A placed bets, sometimes firm A will placed bets on Early market or the kick-off date accidentially changed/postponed before kick-off due to snowing/downpour/etc).

I load the stringdist package to apply the algorithmic matching amatch() the team names.

1. osa - Optimal string aligment, (restricted Damerau-Levenshtein distance).
1. lv - Levenshtein distance (as in R’s native adist).
1. dl - Full Damerau-Levenshtein distance.
1. hamming - Hamming distance (a and b must have same nr of characters).
1. lcs - Longest common substring distance.
1. qgram - q-gram distance.
1. cosine - cosine distance between q-gram profiles.
1. jaccard - Jaccard distance between q-gram profiles.
1. jw - Jaro, or Jaro-Winker distance.
1. soundex - Distance based on soundex encoding (see below).

Lets take an example below.

[1] “Lincoln City”

table 3.2.1 10 x 12

I simply matching the key words Lincoln in Home and Away teams’ name data which get from firm A.

table 3.2.2 10 x 12

From the two tables stated above, I apply stringdist by set the MaxDist to be default value 0.1,0.5,1.0,2.0 and also Inf and select all methods avaiable (10 methods stated above in section 3 before the run coding). Well, I dont pretend to know how does the algorimthic of stringdist() matching the string. Therefore I try both unique teams’ name and also all elements (without filter to be unique).

3.3 Apply agrep()

I tried to simply apply the agrep() function to partially matching the teams’ name.

Table 3.3.1 : Simply apply agrep().
Matching1	team1	spbo1	Matching2	team2	spbo2
Lincoln	Lincoln City	Lincoln	Lincoln City	Lincoln City	NA
Lincoln	NA	Lincoln Red Imps	Lincoln City	NA	NA
Lincoln	NA	Lincoln Reserve	Lincoln City	NA	NA
Lincoln	NA	Lincoln United	Lincoln City	NA	NA
Lincoln	NA	Lincoln Women	Lincoln City	NA	NA
Lincoln	NA	Rivadavia Lincoln	Lincoln City	NA	NA

table 3.3.1 6 x 6

3.4 Apply partialMatch()

Secondly, there is an article from Merging Data Sets Based on Partially Matched Data Elements which apply subset to partial matching the teams’ name.

Below table simply display few matched teams’ name which are not accurate.

Table 3.4.2 : Inaccuracy of Matching Result.
teamID	spboID	Match
AaB Aalborg	AaB Aalborg U17	Partial
Airdrie United	Airdrie United Women	Partial
AS Trencin	AS Trencin U19	Partial
Gremio Barueri	Gremio Barueri SP U20	Partial
Sheffield United	Chesterfield United Women	Partial
Estudiantes Tecos	Estudiantes Tecos U20	Partial

table 3.4.2 1156 x 3

From the table above we all know that the team AaB Aalborg from firm A will match with AaB Aalborg U17 from livescore website and Airdrie United match to Airdrie United Women while there are totally different team and will lead reasearcher calculate a wrong predictive figures for investment.

In order to maximized the soccer matches (observations) available for the research, here I seperates few steps to matching the teams’ name by using split() and cross-matching each others to seperately rearrange the data prior to start the algorithmic matching function in section 4 Reprocess the Data.

4. Reprocess the Data

4.1 Dicission Tree
4.2 Filtering and Reprocess the Data
4.3 StringDist Maximum Likelihood

4.1 Dicission Tree

I would like to plot a hierarchical chart for spliting teams’ name for agrep. However due to rpart and randomForest packages required numeric data while diagram doesn’t special. Here I plot two dynamic graphs.

Since the simpleNetwork() function only apply to 2 columns dataset, here I split to be 2 graphs.

4.2 Filtering and Reprocess the Data

Prior to start the algorithmic string matching, I am using the idea from Apply signature() from country names to reduce some of the minor differences between strings. In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces. So for example, United Kingdom would become kingdomunited which inside the Merging Data Sets Based on Partially Matched Data Elements. It will minimize/reduce the string distance to maximize the matching result.

Here I tried to split teams’ name into list and simply apply grep and agrep to apply first filtering.

4.3 StringDist Maximum Likelihood

There is an good example from How can I match fuzzy match strings from two datasets? which apply expand.grid() to build a data frame and then Expectation Maximization theory by using while loop on stringdist().

From the above table, I’ve matching the teams’ name which is Section 2 Dataset inside Betting Strategy and Model Validation. Here I apply method = osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, soundex inside the stringdist function. You are feel free to apply the function to scrap and also re-arrange the teams’ name and soccer scores data for your own odds price modelling.

5. Result

5.1 Checked and Filtered the Teams’ Name
5.2 Comparison of the Model
5.3 Future Works

5.1 Checked and Filtered the Teams’ Name

Here I tried to manually check the teams’ name and compile as a file to compare the accuracy of the stringDist().

5.2 Comparison of the Model

Firstly, we try to filter-up the teams’ name.

Secondly, we simply compare the accuracy and also number of teams.

Table 5.2.2 : Summary of Matching Result 1
	match	rate	n
spbo	spbo	1.0000000	1395
osa	osa	0.9333333	1302
lv	lv	0.9333333	1302
dl	dl	0.9333333	1302
hamming	hamming	0.9333333	1302
lcs	lcs	0.9333333	1302
qgram	qgram	0.9333333	1302
cosine	cosine	0.9333333	1302
jaccard	jaccard	0.9333333	1302
jw	jw	0.9333333	1302
soundex	soundex	0.9333333	1302

Same with above, we simply filter the PartialMatch function.

Here we also summarized the table.

Table 5.2.4 : Summary of Matching Result 2
	match	rate	n
spbo	spbo	1.0000000	1156
PartialMatch	PartialMatch	0.9801038	1133

Based from the above two functions, we know that modified stringdist() which is stringDistList() has correctly gather 1302 teams from 1395 teams. Meanwhile partialMatch() has matched 1133 teams from 1156 teams. More teams correctly gathered the information to diversify the investment opportunity on different leagues.

5.3 Future Works

There will be more accurate to approximately matching if I apply multivariate matching kick-off time and also both home team and away team at once. I was initially tried to match the teams name by criteria kick-off time but the kick-off time will sometimes unexpected change few hours prior to kick-off.

I will also write as a package to easier load and log.

6. Appendices

6.1 Documenting File Creation
6.2 Versions’ Log
6.3 Speech and Blooper
6.4 References

6.1 Documenting File Creation

It’s useful to record some information about how your file was created.

File creation date: 2015-10-29
R version 3.2.3 (2015-12-10)
R version (short form): 3.2.3
rmarkdown package version: 0.9.2
File version: 1.0.3
File latest updated date: 2016-02-19
Author Profile: ®γσ, Eng Lian Hu
GitHub: Source Code
Additional session information

[1] “2016-02-19 23:13:27 EST” setting value
version R version 3.2.3 (2015-12-10) system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
tz America/New_York
date 2016-02-19
sysname release “Linux” “3.10.0-229.20.1.el7.x86_64” version nodename “#1 SMP Tue Nov 3 19:10:07 UTC 2015” “rstudio-scibrokes” machine login “x86_64” “unknown” user effective_user “ryoeng” “ryoeng”

6.2 Versions’ Log

File version: 1.0.0
- file created
- Applied regular expression to filter and approximately matching the soccer teams name.
File version: 1.0.1
- Changed to use datatble from DT package to make the tables be dynamic.
File version: 1.0.2 - “2015-11-22 09:41:49 JST”
- Added Blooper since retest the coding
- Added Section 5. Result which make comparison between models.
File version: 1.0.3 - “2016-02-05 05:24:35 EST”
- Modified datatable to make the documents can be save as xls/csv file by refer to How to set multiple option list and extensions in DT::datatable.
- Added log file for version upgraded

6.3 Speech and Blooper

There are quite some errors when I knit HTML:

let say always stuck (which is not response and consider as completed) at 29%. I tried couple times while sometimes prompt me different errors (upgrade Droplet to larger RAM memory space doesn’t helps) and eventually apply rm() and gc() to remove the object after use and also clear the memory space.
Need to reload the package suppressAll(library('networkD3')) which in chunk decission-tree-A prior to apply function simpleNetwork while I load it in chunk libs at the beginning of the section 1. Otherwise cannot found that particlar function.