Data 607 - Project 1, Chess Tournament Data Formatting

Import raw data and necessary packages

Stringr will be used for parsing text. DT will be used to display nice data tables. Sqldf will be used to perform SQL commands easily on dataframes.

RawTournamentInfo = readChar("tournamentinfo.txt",file.info("tournamentinfo.txt")$size)

library(stringr)
library(DT)
library(sqldf)

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: RSQLite

Split data into “rows”

Each row is separated by many dashes followed by a newline character. Since readChar interprets each character literally, carriage returns and new lines are stated in plain text, which makes them much easier to find. The code below separates each “row” by finding a long group of dashes followed by a carriage return and newline character. Empty strings are then removed from the output.

RowSplit = unlist(str_split(RawTournamentInfo, "-+\\\r\\\n"))
RowSplit = RowSplit[RowSplit != ""]

head(RowSplit)

## [1] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| \r\n Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | \r\n"
## [2] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|\r\n   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |\r\n"  
## [3] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|\r\n   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |\r\n"  
## [4] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|\r\n   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |\r\n"  
## [5] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|\r\n   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |\r\n"  
## [6] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|\r\n   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |\r\n"

Split individual player data into rows

In the code below, I split each “row” into a list of individual rows separated by a carriage return and a newline. This list of rows format keeps each player’s data as their own element in the list, while parsing each newline into individual rows. Already, the data is looking much cleaner from when I originally started.

ListRows = str_split(RowSplit,"\\\r\\\n")

head(ListRows)

## [[1]]
## [1] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [2] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [3] ""                                                                                          
## 
## [[2]]
## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [3] ""                                                                                         
## 
## [[3]]
## [1] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [2] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [3] ""                                                                                         
## 
## [[4]]
## [1] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
## [2] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [3] ""                                                                                         
## 
## [[5]]
## [1] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
## [2] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [3] ""                                                                                         
## 
## [[6]]
## [1] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
## [2] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [3] ""

Separate top row from bottom row

Since the original data contains different information about the same player in sets of two rows, I separated the top row from the bottom row in order to parse the data further. The code below takes the first element of each item in the list and stores it as a vector. The same thing is done for the second row.

TopRows = sapply(1:length(ListRows), function(x) ListRows[[x]][1] )
BottomRows = sapply(1:length(ListRows), function(x) ListRows[[x]][2] )

head(TopRows)

## [1] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [2] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [3] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
## [4] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|" 
## [5] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|" 
## [6] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"

head(BottomRows)

## [1] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## [3] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [4] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## [5] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |" 
## [6] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"

Parse rows by pipe separator (“|”)

Now that the rows are neatly organized and all the newline characters are removed, I can finally parse each row by the pipe separator.

TopRowsList = str_split(TopRows[2:length(TopRows)],"\\|")
BottomRowsList = str_split(BottomRows[2:length(BottomRows)],"\\|")

TopRowsList[1:2]

## [[1]]
##  [1] "    1 "                           
##  [2] " GARY HUA                        "
##  [3] "6.0  "                            
##  [4] "W  39"                            
##  [5] "W  21"                            
##  [6] "W  18"                            
##  [7] "W  14"                            
##  [8] "W   7"                            
##  [9] "D  12"                            
## [10] "D   4"                            
## [11] ""                                 
## 
## [[2]]
##  [1] "    2 "                           
##  [2] " DAKSHESH DARURI                 "
##  [3] "6.0  "                            
##  [4] "W  63"                            
##  [5] "W  58"                            
##  [6] "L   4"                            
##  [7] "W  17"                            
##  [8] "W  16"                            
##  [9] "W  20"                            
## [10] "W   7"                            
## [11] ""

BottomRowsList[1:2]

## [[1]]
##  [1] "   ON "                           
##  [2] " 15445895 / R: 1794   ->1817     "
##  [3] "N:2  "                            
##  [4] "W    "                            
##  [5] "B    "                            
##  [6] "W    "                            
##  [7] "B    "                            
##  [8] "W    "                            
##  [9] "B    "                            
## [10] "W    "                            
## [11] ""                                 
## 
## [[2]]
##  [1] "   MI "                           
##  [2] " 14598900 / R: 1553   ->1663     "
##  [3] "N:2  "                            
##  [4] "B    "                            
##  [5] "W    "                            
##  [6] "B    "                            
##  [7] "W    "                            
##  [8] "B    "                            
##  [9] "W    "                            
## [10] "B    "                            
## [11] ""

Create function to fetch and clean text for items in the list

Here, I create a function that fetches a specified element of each item in a list. For example, if I wanted the first element of each item in a list, I would write GetListElement(ListName,1). The function also trims the leading and trailing whitespace for each element, which makes text processing much easier.

GetListElement = function(list,element){
  
  output = sapply(1:length(list), function(x) trimws(list[[x]][element],which = "both"))
  
  return(output)
  
}

Create function to fetch opponent IDs

The function below parses the opponent IDs. I wrote this function becauase the GetListElement doesn’t parse the opponent IDs properly. The function matches the last non space group of characters in each string. This can either be a number, which corresponds to an opponent, or a letter, which means that the person did not have an opponent for a particular round.

GetOpponentID = function(OpponentVector){
  
  output = unlist(str_extract_all(OpponentVector,"[^[:space:]]+$"))
  
  return(output)
}

Fetch individual data elements from each row

The following code uses the functions specified above to fetch individual data elements from each row. For the most part, the data elements are nicely formatted and ready to be put into the final dataframe. However, the player pre rating still needs to be processed so that I can calculate the average opponent rating.

PlayerID = as.numeric(GetListElement(TopRowsList,1))
PlayerName = GetListElement(TopRowsList,2)
TotalPoints = GetListElement(TopRowsList,3)

Opponent1 = GetOpponentID(GetListElement(TopRowsList,4))
Opponent2 = GetOpponentID(GetListElement(TopRowsList,5))
Opponent3 = GetOpponentID(GetListElement(TopRowsList,6))
Opponent4 = GetOpponentID(GetListElement(TopRowsList,7))
Opponent5 = GetOpponentID(GetListElement(TopRowsList,8))
Opponent6 = GetOpponentID(GetListElement(TopRowsList,9))
Opponent7 = GetOpponentID(GetListElement(TopRowsList,10))

PlayerState = GetListElement(BottomRowsList,1)
PlayerPreRating = GetListElement(BottomRowsList,2)

Parse player pre rating

Player pre rating follows a consistent pattern, which makes it relatively painless to parse. Each rating starts with the capital letter “R” followed by a semicolon and then a series of numbers. The code below matches that pattern and then further extracts just the number. Since the pattern only fetches the numbers following the “R:”, provisonal ratings are not matched.

PlayerPreRating = unlist(str_extract_all(PlayerPreRating,"R:[[:space:]]*[[:digit:]]+"))
PlayerPreRating = as.numeric(unlist(str_extract_all(PlayerPreRating,"[[:digit:]]+")))

head(PlayerPreRating)

## [1] 1794 1553 1384 1716 1655 1686

Create a table of player rating data

The following code organizes player data by name, ID, and pre rating. This table will be used to calculate the average opponent player rating.

PlayerData = data.frame(PlayerName,PlayerID,PlayerPreRating,stringsAsFactors = F)

DT::datatable(PlayerData,options = list(pageLength = 10))

Create a table for opponent data

The following code creates a one-to-many table relating player IDs to opponent IDs. The code also removes matches where the player did not compete in a match against another player (these are the na’s created through coercion). This table will be used to calculate the average opponent rating for each player.

PlayerOpponent = na.omit(cbind.data.frame(PlayerID = rep(PlayerID,7),Opponent = as.numeric(c(Opponent1,
                                                                    Opponent2,
                                                                    Opponent3,
                                                                    Opponent4,
                                                                    Opponent5,
                                                                    Opponent6,
                                                                    Opponent7))))

## Warning in data.frame(..., check.names = FALSE): NAs introduced by coercion

DT::datatable(PlayerOpponent,options = list(pageLength = 10))

Join PlayerOpponent table with PlayerData table

The following code uses sqldf to use sql commands to join tables together. This package is very convenient for SQL minded folk who do not like R’s merge() function (i.e. me). The code below is a simple left join that relates the PlayerOpponent table with the PlayerData table.

PlayerOpponent = sqldf("SELECT
                          PO.PlayerID,
                          PO.Opponent,
                          PD.PlayerPreRating
                        FROM
                          PlayerOpponent as PO
                        LEFT JOIN
                          PlayerData as PD on
                          PO.Opponent = PD.PlayerID")

DT::datatable(PlayerOpponent,options = list(pageLength = 10))

Calculate average opponent rating

Using sqldf again, I group players by their player IDs and calculate the average rating for all the opponents they faced. This table will be used in the final output table.

AverageOpponentRating = sqldf("SELECT
                                 PlayerID,
                                 avg(PlayerPreRating) as AverageOpponentRating
                               FROM
                                 PlayerOpponent
                               GROUP BY
                                 PlayerID")

DT::datatable(AverageOpponentRating,options = list(pageLength = 10))

Putting everything together

The code below creates the final output dataframe and displays the result in a neat little table.

FinalTable = data.frame(PlayerName,
                        PlayerState,
                        TotalPoints = as.numeric(TotalPoints),
                        PlayerPreRating,
                        AverageOpponentRating = AverageOpponentRating$AverageOpponentRating,
                        stringsAsFactors = F)

DT::datatable(FinalTable,options = list(pageLength = 10))

Exporting the data

The following code exports the final dataframe to a csv file named “TournamentInfoReformatted.csv”. I set row.names to false because it would print the row index as an additional column next to the actual data.

write.csv(FinalTable, file = "TournamentInfoReformatted.csv", row.names = F)