In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
For the first player, the information would be:
Gary Hua, ON, 6.0, 1794, 1605
1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.
While there are many packages for dealing with tables, joins, regex, string parsing, and the like, I will attempt to use base R as much as possible. I will make an exception to use a package for the data frame manipulation, for which I will use data.table. I will also load the ggplot2 package at the end for some graphics.
Further, while the average opponent pre-tournament rating may not be an integer, the example shows an integer. Therefore, I will round the averages and cast them to integer. Lastly, I will do my best to explain the salient points of each section of the code in the text, and post the entire code and its output at the end.
Once downloaded, the data file looks like this:
-----------------------------------------------------------------------------------------
Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
-----------------------------------------------------------------------------------------
2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
-----------------------------------------------------------------------------------------
3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |
-----------------------------------------------------------------------------------------
4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|
MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |
-----------------------------------------------------------------------------------------
It has dedicated header rows and then three kinds of data rows:
A shame they don’t have a regular structure as FIDE does (pdf) but then this course would need a new project!
There are two different ways to approach the data extraction and munging:
I’ve had some experience with non-regularly delimited files and R and I eventually switched to Python which is better for that kind of work. Here, the files are semi-regularly delimited in that there are only three kinds of rows (excluding the header) and the file is short, so I am going to attempt the second approach.
Having access to the file already, I can take advantage of knowing its structure. The plan is to read the file line-by-line (skipping the first 3 header lines) and then handle the two data lines appropriately. The file will be opened as a read-only text connection, and that will be passed to readLines for processing.
While it would be safer to pass through the file one time to get the number of lines and then a second time to process it, this file is lacking an EOF (or at least an EOL on the last line). Thus, while there are 196 lines, a system call like “wc” only returns 195, so a different way to traverse the file is needed.
The next obvious approach in R would be to use a while loop and set the end of file condition to be when the length of the read line returns 0. This is what readLines does when past the end of a file. However, while loops check at the start of the next iteration, and the last line not having an EOF or EOL marker would cause errors. A do_until loop would be optimal for this kind of loop, as the check condition occurs not at the start of the next iteration but the end of the current iteration. Alas, R has no do_until, but it does have repeat, next, and break, between which the flow can be properly monitored.
While it may be tempting to pass through the data and store it in one flat file/table, there are significant benefits to storing this data in two tables:
Player’s opponents are not identified by name but by a numeric code. Players also do not necessarily play the same number of matches. If there is one table with one row per player with their ID and basic information, and another with each individual match and the participants, the tables can be joined as needed to calculate the average opponent rankings. While it is certainly more space efficient to store each match only once, it is also easier to store it twice. Not only does it allow for a simple traversal of the file which already has everything doubled, it also makes the join easier as each column reflects the tournament in its entirety.
The readLines command reads in the entire line as one character string, so the first thing that needs to be done is to split the string at the delimiters, which are pipes (|). The strsplit command will do that, but fixed = TRUE must be passed, otherwise R will start looking for regex. Realizing that strsplit returns a list, and that it’s easier to deal with vectors, the strsplit call will be wrapped in unlist. Lastly, there is a slew of white space at either end of many of these values, so to have a consistent size and starting location for each sub-string, the compound call will be further wrapped by trimws.
So once readLines has imported a data line (not a line of hyphens), it will be processed by the call trimws(unlist(strsplit(CurLine, "|", fixed = TRUE)), which = "both"). If it is line 1, the player’s ID, name, and points are the first three elements of the unlisted and trimmed character vector, and those will be assigned to temporary variables. These will be loaded into the Players table after the second row is processed.
The remainder of the first row is the match data. Once trimmed, it always has the same format. It is always five characters long, it always starts with a letter, and it may end in a number or in blanks. The number is at most two digits, as there were only 64 contestants. Therefore, for the remaining entries in the vector, if the last two digits are extracted and converted to an integer, either the opponents player ID, or an NA if there was none, will be the result. These values will be appended to the Matches table, which is simply the list of pairs of players. Every match will appear twice, as the player in question is always first and the opponent is always second. This is handled by the following code:1
for (i in seq_len(length(CurLine) - 3L)) {
Matches <- rbind(Matches, sata.frame(P1 = CurID,
P2 = as.integer(substr(CurLine[i + 3], 4, 5))))
}
In synopsis, starting at entry 4 of the parsed first row and traversing to the end, identify the opponent—if there was any. Then immediately update the Matches table through an rbind.
In the second data row there is only player data to be extracted—specifically the player’s state and pre-tournament ranking. After processing the raw read line similarly to the first row, the first entry is the state which is assigned to a temporary variable.
The second entry of this line is itself a compound character string, but which has a consistent format. It starts with a nine-character string, a space, a solidus (/), another space, an R, a colon, a space, the pre-tournament ranking, some more spaces, an arrow, and then the post tournament ranking. Thus, splitting this entry using : as the delimiter, taking the second half, extracting the first four characters, trimming the white space, and converting to integer will return the value of the pre-tournament ranking. That sounds complicated, but the call isn’t:
as.integer(substr(unlist(strsplit(CurLine[2], split = ": ", fixed = TRUE))[2], 1, 4))
I wish markdown would make it easier to use color, but in the above call, from inside out (as all good programmers should)2, the second entry of the full data line is itself split at :, it’s second entry’s first 4 characters are taken and turned into an integer. This value is itself stored in a temporary variable.
Now with all the easily accessible player data stored in temporary variables, these are added to the Players table using rbind. This process repeats until readLines returns a length of 0, meaning the file is ended, and the loop is exited via break to mimic a do_until structure.
The file has now been traversed, the connection closed, and two tables have been populated: Players and Matches. For efficiency, the Players table is indexed by the player ID. The remaining step in the challenge is to calculate for each player 1, what the average ranking of corresponding the player 2’s are, remembering that some pairs may have NA for player 2, and append that to the Players table. This will take a few steps, which will be chained in the actual code for efficiency.
The first step is to join the PreRank from the Players table for each player 2 in the Matches table to the player 1 ID of the matches table. This is done via the data.table call
Players[Matches, on = "P_ID == P2", .(P1, PreRank)]
which can be read as:
Left join Players to Matches where P_ID in Matches is the same as P2 in Players and return P1 from Players and P2’s PreRanking from Matches.
In SQL, this would be something like:
SELECT M.P1, P.PreRank
FROM
Players as P
LEFT JOIN Matches as M
on P.P_ID = M.P2;
The next step would be to summarize that table to have one row for each player 1 and the average of the entries associated with it. If the two-column table just created were called X, then next call would be
X[, .(OppPreRank = as.integer(round(mean(PreRank, na.rm = TRUE), 0L))), keyby = P1]
which would be read as: > With X, group by P1 and take the average of all the PreRanks excluding NAs, round it, cast it to an integer, and then call that OppPreRank and then order it by P1.
You can figure out the SQL for that on your own :)
Since data.table allows chaining, the result of the first join does not have to be stored to a variable but can be piped directly to the second call which now explains most of the join:
Players[Matches, on = "P_ID == P2", .(P1, PreRank)][, .(OppPreRank = as.integer(round(
mean(PreRank, na.rm = TRUE), 0L))), keyby = P1]
In data.table syntax, chaining a close bracket to an open bracket like so ][ means use the result of the first call as the new table from which to process the second call.
The result of the chained call should be a table with 64 rows and two columns: one of player IDs and one of their average opponent pre-tournament ranking. What is left is to join this to the rest of the player data in the Players table. This time, an inner join will be used.3 If the result of our chained call was called X, this new call would be Player[X, nomatch = NULL]. Like last time, though, there is no need to explicitly store the intermediate results, and the complicated call above can be simply wrapped in Players[..., nomatch = NULL] which is what is in the code.
At this point, the requested challenge is in memory, and what remains is to write it to a CSV file, which can be done by write.csv, but as data.table is already loaded, the more efficient fwrite can be used.
library(data.table)
Players <- data.table(P_ID = integer(0), Name = character(0L),
State = character(0L), Points = double(0L),
PreRank = integer(0L))
Matches <- data.table(P1 = integer(0), P2 = integer(0))
TD <- file("./tournamentinfo.txt", open = "rt")
# Skip header lines
CurLine <- readLines(TD, 3L, warn = FALSE)
# Main Loop
repeat {
CurLine <- readLines(TD, 1L, warn = FALSE)
lenCL <- length(CurLine)
# Check for end-of-file
if (lenCL == 0L) {
break
}
# Check for new record
if (substr(CurLine, 1L, 1L) == "-") {
recordline <- 1L
} else {
if (recordline == 1L) {
CurLine <- trimws(unlist(strsplit(CurLine, "|", fixed = TRUE)),
which = "both")
CurID <- as.integer(CurLine[1])
CurName <- CurLine[2]
CurPoints <- as.double(CurLine[3])
for (i in seq_len(length(CurLine) - 3L)) {
Matches <- rbind(Matches,
data.table(P1 = CurID,
P2 = as.integer(substr(CurLine[i + 3], 4, 5))))
}
recordline <- 2L
} else {
CurLine <- trimws(unlist(strsplit(CurLine, "|", fixed = TRUE)),
which = "both")
CurState <- CurLine[1]
CurPre <- as.integer(substr(unlist(
strsplit(CurLine[2], split = ": ", fixed = TRUE))[2], 1, 4))
Players <- rbind(Players, data.table(P_ID = CurID, Name = CurName,
State = CurState, Points = CurPoints,
PreRank = CurPre))
}
}
}
close(TD)
setkey(Players, P_ID)
ChessExtract <- Players[
Players[
Matches, on = "P_ID == P2", .(P1, PreRank)
][, .(OppPreRank = as.integer(round(
mean(PreRank, na.rm = TRUE), 0L))), keyby = P1],
nomatch = NULL]
fwrite(ChessExtract, "./ChessExtract.csv")
With only 64 lines, the extract can be written to this document as well.
ChessExtract
## P_ID Name State Points PreRank OppPreRank
## 1: 1 GARY HUA ON 6.0 1794 1605
## 2: 2 DAKSHESH DARURI MI 6.0 1553 1469
## 3: 3 ADITYA BAJAJ MI 6.0 1384 1564
## 4: 4 PATRICK H SCHILLING MI 5.5 1716 1574
## 5: 5 HANSHI ZUO MI 5.5 1655 1501
## 6: 6 HANSEN SONG OH 5.0 1686 1519
## 7: 7 GARY DEE SWATHELL MI 5.0 1649 1372
## 8: 8 EZEKIEL HOUGHTON MI 5.0 1641 1468
## 9: 9 STEFANO LEE ON 5.0 1411 1523
## 10: 10 ANVIT RAO MI 5.0 1365 1554
## 11: 11 CAMERON WILLIAM MC LEMAN MI 4.5 1712 1468
## 12: 12 KENNETH J TACK MI 4.5 1663 1506
## 13: 13 TORRANCE HENRY JR MI 4.5 1666 1498
## 14: 14 BRADLEY SHAW MI 4.5 1610 1515
## 15: 15 ZACHARY JAMES HOUGHTON MI 4.5 1220 1484
## 16: 16 MIKE NIKITIN MI 4.0 1604 1386
## 17: 17 RONALD GRZEGORCZYK MI 4.0 1629 1499
## 18: 18 DAVID SUNDEEN MI 4.0 1600 1480
## 19: 19 DIPANKAR ROY MI 4.0 1564 1426
## 20: 20 JASON ZHENG MI 4.0 1595 1411
## 21: 21 DINH DANG BUI ON 4.0 1563 1470
## 22: 22 EUGENE L MCCLURE MI 4.0 1555 1300
## 23: 23 ALAN BUI ON 4.0 1363 1214
## 24: 24 MICHAEL R ALDRICH MI 4.0 1229 1357
## 25: 25 LOREN SCHWIEBERT MI 3.5 1745 1363
## 26: 26 MAX ZHU ON 3.5 1579 1507
## 27: 27 GAURAV GIDWANI MI 3.5 1552 1222
## 28: 28 SOFIA ADINA STANESCU-BELLU MI 3.5 1507 1522
## 29: 29 CHIEDOZIE OKORIE MI 3.5 1602 1314
## 30: 30 GEORGE AVERY JONES ON 3.5 1522 1144
## 31: 31 RISHI SHETTY MI 3.5 1494 1260
## 32: 32 JOSHUA PHILIP MATHEWS ON 3.5 1441 1379
## 33: 33 JADE GE MI 3.5 1449 1277
## 34: 34 MICHAEL JEFFERY THOMAS MI 3.5 1399 1375
## 35: 35 JOSHUA DAVID LEE MI 3.5 1438 1150
## 36: 36 SIDDHARTH JHA MI 3.5 1355 1388
## 37: 37 AMIYATOSH PWNANANDAM MI 3.5 980 1385
## 38: 38 BRIAN LIU MI 3.0 1423 1539
## 39: 39 JOEL R HENDON MI 3.0 1436 1430
## 40: 40 FOREST ZHANG MI 3.0 1348 1391
## 41: 41 KYLE WILLIAM MURPHY MI 3.0 1403 1248
## 42: 42 JARED GE MI 3.0 1332 1150
## 43: 43 ROBERT GLEN VASEY MI 3.0 1283 1107
## 44: 44 JUSTIN D SCHILLING MI 3.0 1199 1327
## 45: 45 DEREK YAN MI 3.0 1242 1152
## 46: 46 JACOB ALEXANDER LAVALLEY MI 3.0 377 1358
## 47: 47 ERIC WRIGHT MI 2.5 1362 1392
## 48: 48 DANIEL KHAIN MI 2.5 1382 1356
## 49: 49 MICHAEL J MARTIN MI 2.5 1291 1286
## 50: 50 SHIVAM JHA MI 2.5 1056 1296
## 51: 51 TEJAS AYYAGARI MI 2.5 1011 1356
## 52: 52 ETHAN GUO MI 2.5 935 1495
## 53: 53 JOSE C YBARRA MI 2.0 1393 1345
## 54: 54 LARRY HODGE MI 2.0 1270 1206
## 55: 55 ALEX KONG MI 2.0 1186 1406
## 56: 56 MARISA RICCI MI 2.0 1153 1414
## 57: 57 MICHAEL LU MI 2.0 1092 1363
## 58: 58 VIRAJ MOHILE MI 2.0 917 1391
## 59: 59 SEAN M MC CORMICK MI 2.0 853 1319
## 60: 60 JULIA SHEN MI 1.5 967 1330
## 61: 61 JEZZEL FARKAS ON 1.5 955 1327
## 62: 62 ASHWIN BALAJI MI 1.0 1530 1186
## 63: 63 THOMAS JOSEPH HOSMER MI 1.0 1175 1350
## 64: 64 BEN LI MI 1.0 1163 1263
## P_ID Name State Points PreRank OppPreRank
Once the ETL process is complete, the analysis can begin. Some questions which may now be addressed include:
ChessExtract[which.min(ChessExtract$PreRank)]
## P_ID Name State Points PreRank OppPreRank
## 1: 46 JACOB ALEXANDER LAVALLEY MI 3 377 1358
ChessExtract[which.min(ChessExtract$OppPreRank)]
## P_ID Name State Points PreRank OppPreRank
## 1: 43 ROBERT GLEN VASEY MI 3 1283 1107
ChessExtract[which.max(ChessExtract$PreRank)]
## P_ID Name State Points PreRank OppPreRank
## 1: 1 GARY HUA ON 6 1794 1605
ChessExtract[which.max(ChessExtract$OppPreRank)]
## P_ID Name State Points PreRank OppPreRank
## 1: 1 GARY HUA ON 6 1794 1605
cor(ChessExtract$PreRank, ChessExtract$OppPreRank)
## [1] 0.2839375
library(ggplot2)
ggplot(ChessExtract, aes(x = PreRank, y = OppPreRank)) + geom_point() + geom_smooth(method = "lm") + geom_smooth(method = "loess", color = 'red')
sessionInfo()
## R version 3.6.1 Patched (2019-08-04 r76915)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.2.1 data.table_1.12.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 knitr_1.24 magrittr_1.5 tidyselect_0.2.5
## [5] munsell_0.5.0 colorspace_1.4-1 R6_2.4.0 rlang_0.4.0
## [9] stringr_1.4.0 dplyr_0.8.3 tools_3.6.1 grid_3.6.1
## [13] gtable_0.3.0 xfun_0.9 withr_2.1.2 htmltools_0.3.6
## [17] assertthat_0.2.1 yaml_2.2.0 lazyeval_0.2.2 digest_0.6.20
## [21] tibble_2.1.3 crayon_1.3.4 purrr_0.3.2 glue_1.3.1
## [25] evaluate_0.14 rmarkdown_1.15 labeling_0.3 stringi_1.4.3
## [29] compiler_3.6.1 pillar_1.4.2 scales_1.0.0 pkgconfig_2.0.2
See Patrick Burn’s R Inferno (pdf) for why the seq_len(x) or seq_along(x) commands should always be used for looping and not the 1:length(x) formulation.↩
I’m sorry, I cannot help myself.↩
While a left join would be safe, as the two tables are known to have the same number of rows and the same index, an inner join is safer here.↩