Challenge

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name
Player’s State
Total Number of Points
Player’s Pre-Rating
Average Pre-Rating of Opponents

For the first player, the information would be:

Gary Hua, ON, 6.0, 1794, 1605

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

Notes

While there are many packages for dealing with tables, joins, regex, string parsing, and the like, I will attempt to use base R as much as possible. I will make an exception to use a package for the data frame manipulation, for which I will use data.table. I will also load the ggplot2 package at the end for some graphics.

Further, while the average opponent pre-tournament rating may not be an integer, the example shows an integer. Therefore, I will round the averages and cast them to integer. Lastly, I will do my best to explain the salient points of each section of the code in the text, and post the entire code and its output at the end.

File Structure

Once downloaded, the data file looks like this:

-----------------------------------------------------------------------------------------
 Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
 Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
-----------------------------------------------------------------------------------------
    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
-----------------------------------------------------------------------------------------
    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
-----------------------------------------------------------------------------------------
    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|
   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |
-----------------------------------------------------------------------------------------
    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|
   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |
-----------------------------------------------------------------------------------------

It has dedicated header rows and then three kinds of data rows:

A line of hyphens signifying a new record
The first record line containing PartnerID, Name, Points, Win/Loss and Partner
The second Record line containing State, Some other data (personal ID?), pre-rating ranking, post-rating ranking, and other “stuff”

A shame they don’t have a regular structure as FIDE does (pdf) but then this course would need a new project!

Data Extraction

Approach

There are two different ways to approach the data extraction and munging:

Import the data as-is into R and perform the munging in R
Write code that will traverse the file line-by-line and import the data more meaningfully

I’ve had some experience with non-regularly delimited files and R and I eventually switched to Python which is better for that kind of work. Here, the files are semi-regularly delimited in that there are only three kinds of rows (excluding the header) and the file is short, so I am going to attempt the second approach.

File Traversal

Having access to the file already, I can take advantage of knowing its structure. The plan is to read the file line-by-line (skipping the first 3 header lines) and then handle the two data lines appropriately. The file will be opened as a read-only text connection, and that will be passed to readLines for processing.

While it would be safer to pass through the file one time to get the number of lines and then a second time to process it, this file is lacking an EOF (or at least an EOL on the last line). Thus, while there are 196 lines, a system call like “wc” only returns 195, so a different way to traverse the file is needed.

The next obvious approach in R would be to use a while loop and set the end of file condition to be when the length of the read line returns 0. This is what readLines does when past the end of a file. However, while loops check at the start of the next iteration, and the last line not having an EOF or EOL marker would cause errors. A do_until loop would be optimal for this kind of loop, as the check condition occurs not at the start of the next iteration but the end of the current iteration. Alas, R has no do_until, but it does have repeat, next, and break, between which the flow can be properly monitored.

Tables

While it may be tempting to pass through the data and store it in one flat file/table, there are significant benefits to storing this data in two tables:

A Players Table
A Matches Table

Player’s opponents are not identified by name but by a numeric code. Players also do not necessarily play the same number of matches. If there is one table with one row per player with their ID and basic information, and another with each individual match and the participants, the tables can be joined as needed to calculate the average opponent rankings. While it is certainly more space efficient to store each match only once, it is also easier to store it twice. Not only does it allow for a simple traversal of the file which already has everything doubled, it also makes the join easier as each column reflects the tournament in its entirety.

Data Transform

Line Parsing

The readLines command reads in the entire line as one character string, so the first thing that needs to be done is to split the string at the delimiters, which are pipes (|). The strsplit command will do that, but fixed = TRUE must be passed, otherwise R will start looking for regex. Realizing that strsplit returns a list, and that it’s easier to deal with vectors, the strsplit call will be wrapped in unlist. Lastly, there is a slew of white space at either end of many of these values, so to have a consistent size and starting location for each sub-string, the compound call will be further wrapped by trimws.

Process First Data Row

Player Data

So once readLines has imported a data line (not a line of hyphens), it will be processed by the call trimws(unlist(strsplit(CurLine, "|", fixed = TRUE)), which = "both"). If it is line 1, the player’s ID, name, and points are the first three elements of the unlisted and trimmed character vector, and those will be assigned to temporary variables. These will be loaded into the Players table after the second row is processed.

Match Data

The remainder of the first row is the match data. Once trimmed, it always has the same format. It is always five characters long, it always starts with a letter, and it may end in a number or in blanks. The number is at most two digits, as there were only 64 contestants. Therefore, for the remaining entries in the vector, if the last two digits are extracted and converted to an integer, either the opponents player ID, or an NA if there was none, will be the result. These values will be appended to the Matches table, which is simply the list of pairs of players. Every match will appear twice, as the player in question is always first and the opponent is always second. This is handled by the following code:¹

for (i in seq_len(length(CurLine) - 3L)) {
  Matches <- rbind(Matches, sata.frame(P1 = CurID,
                         P2 = as.integer(substr(CurLine[i + 3], 4, 5))))
      }

In synopsis, starting at entry 4 of the parsed first row and traversing to the end, identify the opponent—if there was any. Then immediately update the Matches table through an rbind.

Process Second Data Row

In the second data row there is only player data to be extracted—specifically the player’s state and pre-tournament ranking. After processing the raw read line similarly to the first row, the first entry is the state which is assigned to a temporary variable.

The second entry of this line is itself a compound character string, but which has a consistent format. It starts with a nine-character string, a space, a solidus (/), another space, an R, a colon, a space, the pre-tournament ranking, some more spaces, an arrow, and then the post tournament ranking. Thus, splitting this entry using : as the delimiter, taking the second half, extracting the first four characters, trimming the white space, and converting to integer will return the value of the pre-tournament ranking. That sounds complicated, but the call isn’t:

as.integer(substr(unlist(strsplit(CurLine[2], split = ": ", fixed = TRUE))[2], 1, 4))

I wish markdown would make it easier to use color, but in the above call, from inside out (as all good programmers should)², the second entry of the full data line is itself split at :, it’s second entry’s first 4 characters are taken and turned into an integer. This value is itself stored in a temporary variable.

Now with all the easily accessible player data stored in temporary variables, these are added to the Players table using rbind. This process repeats until readLines returns a length of 0, meaning the file is ended, and the loop is exited via break to mimic a do_until structure.

Joins

The file has now been traversed, the connection closed, and two tables have been populated: Players and Matches. For efficiency, the Players table is indexed by the player ID. The remaining step in the challenge is to calculate for each player 1, what the average ranking of corresponding the player 2’s are, remembering that some pairs may have NA for player 2, and append that to the Players table. This will take a few steps, which will be chained in the actual code for efficiency.

The first step is to join the PreRank from the Players table for each player 2 in the Matches table to the player 1 ID of the matches table. This is done via the data.table call

Players[Matches, on = "P_ID == P2", .(P1, PreRank)]

which can be read as:

Left join Players to Matches where P_ID in Matches is the same as P2 in Players and return P1 from Players and P2’s PreRanking from Matches.

In SQL, this would be something like:

SELECT M.P1, P.PreRank
FROM 
Players as P
LEFT JOIN Matches as M
on P.P_ID = M.P2;

The next step would be to summarize that table to have one row for each player 1 and the average of the entries associated with it. If the two-column table just created were called X, then next call would be

X[, .(OppPreRank = as.integer(round(mean(PreRank, na.rm = TRUE), 0L))), keyby = P1]

which would be read as: > With X, group by P1 and take the average of all the PreRanks excluding NAs, round it, cast it to an integer, and then call that OppPreRank and then order it by P1.

You can figure out the SQL for that on your own :)

Since data.table allows chaining, the result of the first join does not have to be stored to a variable but can be piped directly to the second call which now explains most of the join:

Players[Matches, on = "P_ID == P2", .(P1, PreRank)][, .(OppPreRank = as.integer(round(
                        mean(PreRank, na.rm = TRUE), 0L))), keyby = P1]

In data.table syntax, chaining a close bracket to an open bracket like so ][ means use the result of the first call as the new table from which to process the second call.

The result of the chained call should be a table with 64 rows and two columns: one of player IDs and one of their average opponent pre-tournament ranking. What is left is to join this to the rest of the player data in the Players table. This time, an inner join will be used.³ If the result of our chained call was called X, this new call would be Player[X, nomatch = NULL]. Like last time, though, there is no need to explicitly store the intermediate results, and the complicated call above can be simply wrapped in Players[..., nomatch = NULL] which is what is in the code.

Data Load

At this point, the requested challenge is in memory, and what remains is to write it to a CSV file, which can be done by write.csv, but as data.table is already loaded, the more efficient fwrite can be used.

Code

library(data.table)
Players <- data.table(P_ID = integer(0), Name = character(0L),
                         State = character(0L), Points = double(0L),
                         PreRank = integer(0L))
Matches <- data.table(P1 = integer(0), P2 = integer(0))
TD <- file("./tournamentinfo.txt", open = "rt")
# Skip header lines
CurLine <- readLines(TD, 3L, warn = FALSE)
# Main Loop
repeat {
  CurLine <- readLines(TD, 1L, warn = FALSE)
  lenCL <- length(CurLine)
  # Check for end-of-file
  if (lenCL == 0L) {
    break
  }
  # Check for new record
  if (substr(CurLine, 1L, 1L) == "-") {
    recordline <- 1L
  } else {
    if (recordline == 1L) {
      CurLine <- trimws(unlist(strsplit(CurLine, "|", fixed = TRUE)),
                        which = "both")
      CurID <- as.integer(CurLine[1])
      CurName <- CurLine[2]
      CurPoints <- as.double(CurLine[3])
      for (i in seq_len(length(CurLine) - 3L)) {
        Matches <- rbind(Matches, 
              data.table(P1 = CurID,
                         P2 = as.integer(substr(CurLine[i + 3], 4, 5))))
      }
      recordline <- 2L
    } else {
      CurLine <- trimws(unlist(strsplit(CurLine, "|", fixed = TRUE)),
                        which = "both")
      CurState <- CurLine[1]
      CurPre <- as.integer(substr(unlist(
        strsplit(CurLine[2], split = ": ", fixed = TRUE))[2], 1, 4))
      Players <- rbind(Players, data.table(P_ID = CurID, Name = CurName,
                                           State = CurState, Points = CurPoints,
                                           PreRank = CurPre))
    }
  }
}
close(TD)
setkey(Players, P_ID)
ChessExtract <- Players[
                  Players[
                    Matches, on = "P_ID == P2", .(P1, PreRank)
                      ][, .(OppPreRank = as.integer(round(
                        mean(PreRank, na.rm = TRUE), 0L))), keyby = P1],
                  nomatch = NULL]
fwrite(ChessExtract, "./ChessExtract.csv")

ChessExtract

Display

With only 64 lines, the extract can be written to this document as well.

ChessExtract

##     P_ID                       Name State Points PreRank OppPreRank
##  1:    1                   GARY HUA    ON    6.0    1794       1605
##  2:    2            DAKSHESH DARURI    MI    6.0    1553       1469
##  3:    3               ADITYA BAJAJ    MI    6.0    1384       1564
##  4:    4        PATRICK H SCHILLING    MI    5.5    1716       1574
##  5:    5                 HANSHI ZUO    MI    5.5    1655       1501
##  6:    6                HANSEN SONG    OH    5.0    1686       1519
##  7:    7          GARY DEE SWATHELL    MI    5.0    1649       1372
##  8:    8           EZEKIEL HOUGHTON    MI    5.0    1641       1468
##  9:    9                STEFANO LEE    ON    5.0    1411       1523
## 10:   10                  ANVIT RAO    MI    5.0    1365       1554
## 11:   11   CAMERON WILLIAM MC LEMAN    MI    4.5    1712       1468
## 12:   12             KENNETH J TACK    MI    4.5    1663       1506
## 13:   13          TORRANCE HENRY JR    MI    4.5    1666       1498
## 14:   14               BRADLEY SHAW    MI    4.5    1610       1515
## 15:   15     ZACHARY JAMES HOUGHTON    MI    4.5    1220       1484
## 16:   16               MIKE NIKITIN    MI    4.0    1604       1386
## 17:   17         RONALD GRZEGORCZYK    MI    4.0    1629       1499
## 18:   18              DAVID SUNDEEN    MI    4.0    1600       1480
## 19:   19               DIPANKAR ROY    MI    4.0    1564       1426
## 20:   20                JASON ZHENG    MI    4.0    1595       1411
## 21:   21              DINH DANG BUI    ON    4.0    1563       1470
## 22:   22           EUGENE L MCCLURE    MI    4.0    1555       1300
## 23:   23                   ALAN BUI    ON    4.0    1363       1214
## 24:   24          MICHAEL R ALDRICH    MI    4.0    1229       1357
## 25:   25           LOREN SCHWIEBERT    MI    3.5    1745       1363
## 26:   26                    MAX ZHU    ON    3.5    1579       1507
## 27:   27             GAURAV GIDWANI    MI    3.5    1552       1222
## 28:   28 SOFIA ADINA STANESCU-BELLU    MI    3.5    1507       1522
## 29:   29           CHIEDOZIE OKORIE    MI    3.5    1602       1314
## 30:   30         GEORGE AVERY JONES    ON    3.5    1522       1144
## 31:   31               RISHI SHETTY    MI    3.5    1494       1260
## 32:   32      JOSHUA PHILIP MATHEWS    ON    3.5    1441       1379
## 33:   33                    JADE GE    MI    3.5    1449       1277
## 34:   34     MICHAEL JEFFERY THOMAS    MI    3.5    1399       1375
## 35:   35           JOSHUA DAVID LEE    MI    3.5    1438       1150
## 36:   36              SIDDHARTH JHA    MI    3.5    1355       1388
## 37:   37       AMIYATOSH PWNANANDAM    MI    3.5     980       1385
## 38:   38                  BRIAN LIU    MI    3.0    1423       1539
## 39:   39              JOEL R HENDON    MI    3.0    1436       1430
## 40:   40               FOREST ZHANG    MI    3.0    1348       1391
## 41:   41        KYLE WILLIAM MURPHY    MI    3.0    1403       1248
## 42:   42                   JARED GE    MI    3.0    1332       1150
## 43:   43          ROBERT GLEN VASEY    MI    3.0    1283       1107
## 44:   44         JUSTIN D SCHILLING    MI    3.0    1199       1327
## 45:   45                  DEREK YAN    MI    3.0    1242       1152
## 46:   46   JACOB ALEXANDER LAVALLEY    MI    3.0     377       1358
## 47:   47                ERIC WRIGHT    MI    2.5    1362       1392
## 48:   48               DANIEL KHAIN    MI    2.5    1382       1356
## 49:   49           MICHAEL J MARTIN    MI    2.5    1291       1286
## 50:   50                 SHIVAM JHA    MI    2.5    1056       1296
## 51:   51             TEJAS AYYAGARI    MI    2.5    1011       1356
## 52:   52                  ETHAN GUO    MI    2.5     935       1495
## 53:   53              JOSE C YBARRA    MI    2.0    1393       1345
## 54:   54                LARRY HODGE    MI    2.0    1270       1206
## 55:   55                  ALEX KONG    MI    2.0    1186       1406
## 56:   56               MARISA RICCI    MI    2.0    1153       1414
## 57:   57                 MICHAEL LU    MI    2.0    1092       1363
## 58:   58               VIRAJ MOHILE    MI    2.0     917       1391
## 59:   59          SEAN M MC CORMICK    MI    2.0     853       1319
## 60:   60                 JULIA SHEN    MI    1.5     967       1330
## 61:   61              JEZZEL FARKAS    ON    1.5     955       1327
## 62:   62              ASHWIN BALAJI    MI    1.0    1530       1186
## 63:   63       THOMAS JOSEPH HOSMER    MI    1.0    1175       1350
## 64:   64                     BEN LI    MI    1.0    1163       1263
##     P_ID                       Name State Points PreRank OppPreRank

Analysis

Once the ETL process is complete, the analysis can begin. Some questions which may now be addressed include:

Who had the lowest pre-tournament average?
Who had the lowest opponent pre-tournament average?
Who had the highest pre-tournament average?
Who had the highest opponent pre-tournament average?
Is there a relationship between a competitor’s pre-tournament ranking and the average of their opponents?
Does a linear or smoothed relationship make more sense?

ChessExtract[which.min(ChessExtract$PreRank)]

##    P_ID                     Name State Points PreRank OppPreRank
## 1:   46 JACOB ALEXANDER LAVALLEY    MI      3     377       1358

ChessExtract[which.min(ChessExtract$OppPreRank)]

##    P_ID              Name State Points PreRank OppPreRank
## 1:   43 ROBERT GLEN VASEY    MI      3    1283       1107

ChessExtract[which.max(ChessExtract$PreRank)]

##    P_ID     Name State Points PreRank OppPreRank
## 1:    1 GARY HUA    ON      6    1794       1605

ChessExtract[which.max(ChessExtract$OppPreRank)]

##    P_ID     Name State Points PreRank OppPreRank
## 1:    1 GARY HUA    ON      6    1794       1605

cor(ChessExtract$PreRank, ChessExtract$OppPreRank)

## [1] 0.2839375

library(ggplot2)
ggplot(ChessExtract, aes(x = PreRank, y = OppPreRank)) + geom_point() + geom_smooth(method = "lm") + geom_smooth(method = "loess", color = 'red')

SessionInfo

sessionInfo()

## R version 3.6.1 Patched (2019-08-04 r76915)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.2.1     data.table_1.12.2
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2       knitr_1.24       magrittr_1.5     tidyselect_0.2.5
##  [5] munsell_0.5.0    colorspace_1.4-1 R6_2.4.0         rlang_0.4.0     
##  [9] stringr_1.4.0    dplyr_0.8.3      tools_3.6.1      grid_3.6.1      
## [13] gtable_0.3.0     xfun_0.9         withr_2.1.2      htmltools_0.3.6 
## [17] assertthat_0.2.1 yaml_2.2.0       lazyeval_0.2.2   digest_0.6.20   
## [21] tibble_2.1.3     crayon_1.3.4     purrr_0.3.2      glue_1.3.1      
## [25] evaluate_0.14    rmarkdown_1.15   labeling_0.3     stringi_1.4.3   
## [29] compiler_3.6.1   pillar_1.4.2     scales_1.0.0     pkgconfig_2.0.2

See Patrick Burn’s R Inferno (pdf) for why the seq_len(x) or seq_along(x) commands should always be used for looping and not the 1:length(x) formulation.↩
I’m sorry, I cannot help myself.↩
While a left join would be safe, as the two tables are known to have the same number of rows and the same index, an inner join is safer here.↩

CUNY MSDS DATA 607 Project 1

Avraham Adler

9/8/2019