Stringr will be used for parsing text. DT will be used to display nice data tables. Sqldf will be used to perform SQL commands easily on dataframes.
RawTournamentInfo = readChar("tournamentinfo.txt",file.info("tournamentinfo.txt")$size)
library(stringr)
library(DT)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
Each row is separated by many dashes followed by a newline character. Since readChar interprets each character literally, carriage returns and new lines are stated in plain text, which makes them much easier to find. The code below separates each “row” by finding a long group of dashes followed by a carriage return and newline character. Empty strings are then removed from the output.
RowSplit = unlist(str_split(RawTournamentInfo, "-+\\\r\\\n"))
RowSplit = RowSplit[RowSplit != ""]
head(RowSplit)
## [1] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| \r\n Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | \r\n"
## [2] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|\r\n ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |\r\n"
## [3] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|\r\n MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |\r\n"
## [4] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|\r\n MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |\r\n"
## [5] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|\r\n MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |\r\n"
## [6] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|\r\n MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |\r\n"
In the code below, I split each “row” into a list of individual rows separated by a carriage return and a newline. This list of rows format keeps each player’s data as their own element in the list, while parsing each newline into individual rows. Already, the data is looking much cleaner from when I originally started.
ListRows = str_split(RowSplit,"\\\r\\\n")
head(ListRows)
## [[1]]
## [1] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [2] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [3] ""
##
## [[2]]
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [3] ""
##
## [[3]]
## [1] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [2] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [3] ""
##
## [[4]]
## [1] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [2] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [3] ""
##
## [[5]]
## [1] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [2] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [3] ""
##
## [[6]]
## [1] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
## [2] " MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
## [3] ""
Since the original data contains different information about the same player in sets of two rows, I separated the top row from the bottom row in order to parse the data further. The code below takes the first element of each item in the list and stores it as a vector. The same thing is done for the second row.
TopRows = sapply(1:length(ListRows), function(x) ListRows[[x]][1] )
BottomRows = sapply(1:length(ListRows), function(x) ListRows[[x]][2] )
head(TopRows)
## [1] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [2] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [3] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [4] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [5] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [6] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
head(BottomRows)
## [1] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [2] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [3] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [4] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [5] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [6] " MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
Now that the rows are neatly organized and all the newline characters are removed, I can finally parse each row by the pipe separator.
TopRowsList = str_split(TopRows[2:length(TopRows)],"\\|")
BottomRowsList = str_split(BottomRows[2:length(BottomRows)],"\\|")
TopRowsList[1:2]
## [[1]]
## [1] " 1 "
## [2] " GARY HUA "
## [3] "6.0 "
## [4] "W 39"
## [5] "W 21"
## [6] "W 18"
## [7] "W 14"
## [8] "W 7"
## [9] "D 12"
## [10] "D 4"
## [11] ""
##
## [[2]]
## [1] " 2 "
## [2] " DAKSHESH DARURI "
## [3] "6.0 "
## [4] "W 63"
## [5] "W 58"
## [6] "L 4"
## [7] "W 17"
## [8] "W 16"
## [9] "W 20"
## [10] "W 7"
## [11] ""
BottomRowsList[1:2]
## [[1]]
## [1] " ON "
## [2] " 15445895 / R: 1794 ->1817 "
## [3] "N:2 "
## [4] "W "
## [5] "B "
## [6] "W "
## [7] "B "
## [8] "W "
## [9] "B "
## [10] "W "
## [11] ""
##
## [[2]]
## [1] " MI "
## [2] " 14598900 / R: 1553 ->1663 "
## [3] "N:2 "
## [4] "B "
## [5] "W "
## [6] "B "
## [7] "W "
## [8] "B "
## [9] "W "
## [10] "B "
## [11] ""
Here, I create a function that fetches a specified element of each item in a list. For example, if I wanted the first element of each item in a list, I would write GetListElement(ListName,1). The function also trims the leading and trailing whitespace for each element, which makes text processing much easier.
GetListElement = function(list,element){
output = sapply(1:length(list), function(x) trimws(list[[x]][element],which = "both"))
return(output)
}
The function below parses the opponent IDs. I wrote this function becauase the GetListElement doesn’t parse the opponent IDs properly. The function matches the last non space group of characters in each string. This can either be a number, which corresponds to an opponent, or a letter, which means that the person did not have an opponent for a particular round.
GetOpponentID = function(OpponentVector){
output = unlist(str_extract_all(OpponentVector,"[^[:space:]]+$"))
return(output)
}
The following code uses the functions specified above to fetch individual data elements from each row. For the most part, the data elements are nicely formatted and ready to be put into the final dataframe. However, the player pre rating still needs to be processed so that I can calculate the average opponent rating.
PlayerID = as.numeric(GetListElement(TopRowsList,1))
PlayerName = GetListElement(TopRowsList,2)
TotalPoints = GetListElement(TopRowsList,3)
Opponent1 = GetOpponentID(GetListElement(TopRowsList,4))
Opponent2 = GetOpponentID(GetListElement(TopRowsList,5))
Opponent3 = GetOpponentID(GetListElement(TopRowsList,6))
Opponent4 = GetOpponentID(GetListElement(TopRowsList,7))
Opponent5 = GetOpponentID(GetListElement(TopRowsList,8))
Opponent6 = GetOpponentID(GetListElement(TopRowsList,9))
Opponent7 = GetOpponentID(GetListElement(TopRowsList,10))
PlayerState = GetListElement(BottomRowsList,1)
PlayerPreRating = GetListElement(BottomRowsList,2)
Player pre rating follows a consistent pattern, which makes it relatively painless to parse. Each rating starts with the capital letter “R” followed by a semicolon and then a series of numbers. The code below matches that pattern and then further extracts just the number. Since the pattern only fetches the numbers following the “R:”, provisonal ratings are not matched.
PlayerPreRating = unlist(str_extract_all(PlayerPreRating,"R:[[:space:]]*[[:digit:]]+"))
PlayerPreRating = as.numeric(unlist(str_extract_all(PlayerPreRating,"[[:digit:]]+")))
head(PlayerPreRating)
## [1] 1794 1553 1384 1716 1655 1686
The following code organizes player data by name, ID, and pre rating. This table will be used to calculate the average opponent player rating.
PlayerData = data.frame(PlayerName,PlayerID,PlayerPreRating,stringsAsFactors = F)
DT::datatable(PlayerData,options = list(pageLength = 10))
The following code creates a one-to-many table relating player IDs to opponent IDs. The code also removes matches where the player did not compete in a match against another player (these are the na’s created through coercion). This table will be used to calculate the average opponent rating for each player.
PlayerOpponent = na.omit(cbind.data.frame(PlayerID = rep(PlayerID,7),Opponent = as.numeric(c(Opponent1,
Opponent2,
Opponent3,
Opponent4,
Opponent5,
Opponent6,
Opponent7))))
## Warning in data.frame(..., check.names = FALSE): NAs introduced by coercion
DT::datatable(PlayerOpponent,options = list(pageLength = 10))
The following code uses sqldf to use sql commands to join tables together. This package is very convenient for SQL minded folk who do not like R’s merge() function (i.e. me). The code below is a simple left join that relates the PlayerOpponent table with the PlayerData table.
PlayerOpponent = sqldf("SELECT
PO.PlayerID,
PO.Opponent,
PD.PlayerPreRating
FROM
PlayerOpponent as PO
LEFT JOIN
PlayerData as PD on
PO.Opponent = PD.PlayerID")
DT::datatable(PlayerOpponent,options = list(pageLength = 10))
Using sqldf again, I group players by their player IDs and calculate the average rating for all the opponents they faced. This table will be used in the final output table.
AverageOpponentRating = sqldf("SELECT
PlayerID,
avg(PlayerPreRating) as AverageOpponentRating
FROM
PlayerOpponent
GROUP BY
PlayerID")
DT::datatable(AverageOpponentRating,options = list(pageLength = 10))
The code below creates the final output dataframe and displays the result in a neat little table.
FinalTable = data.frame(PlayerName,
PlayerState,
TotalPoints = as.numeric(TotalPoints),
PlayerPreRating,
AverageOpponentRating = AverageOpponentRating$AverageOpponentRating,
stringsAsFactors = F)
DT::datatable(FinalTable,options = list(pageLength = 10))
The following code exports the final dataframe to a csv file named “TournamentInfoReformatted.csv”. I set row.names to false because it would print the row index as an additional column next to the actual data.
write.csv(FinalTable, file = "TournamentInfoReformatted.csv", row.names = F)