Excerpt from source file
Our purpose is to take the source file tournamentinfo.txt and generate a .CSV file that has the following information for each player: Player's Name, Player's State, Total Number of Points, Players' Pre-Rating, and Average Pre Chess Rating of Opponents.
We calculate the last column by looking up the pre rating of each opponent by the pair number. So Gary Hua’s first opponent in Round 1 was pair number 39, Joel R Hendon, who had a pre chess rating of 1436P23, but we only care about 1436. We add up the rating for all the opponents, then divide by seven to arrive at the Average Pre Chess Rating of Opponents.
Let’s start by reading in the file.
library(stringr)
url <- "https://raw.githubusercontent.com/EyeDen/data607/master/tournamentinfo.txt"
download.file(url, destfile = "tournamentinfo.txt")
tour.info <- read.csv("tournamentinfo.txt", header = FALSE, sep = "|")
head(tour.info, 10)
## V1
## 1 -----------------------------------------------------------------------------------------
## 2 Pair
## 3 Num
## 4 -----------------------------------------------------------------------------------------
## 5 1
## 6 ON
## 7 -----------------------------------------------------------------------------------------
## 8 2
## 9 MI
## 10 -----------------------------------------------------------------------------------------
## V2 V3 V4 V5 V6 V7 V8
## 1
## 2 Player Name Total Round Round Round Round Round
## 3 USCF ID / Rtg (Pre->Post) Pts 1 2 3 4 5
## 4
## 5 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7
## 6 15445895 / R: 1794 ->1817 N:2 W B W B W
## 7
## 8 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16
## 9 14598900 / R: 1553 ->1663 N:2 B W B W B
## 10
## V9 V10 V11
## 1 NA
## 2 Round Round NA
## 3 6 7 NA
## 4 NA
## 5 D 12 D 4 NA
## 6 B W NA
## 7 NA
## 8 W 20 W 7 NA
## 9 W B NA
## 10 NA
Not bad, but it’s not exactly structured the way we’d like. Firstly, the source file actually contains two headers. This means that every entry is really a pair. However, if we examine each entry’s second row, we will see that we don’t need all the information contained. Secondly, the “——” lines might make the text file more readable, but turns our data frame into a mess. Lastly, the final column is nothing but NA values, a quirk of the original file.
Removing the extraneous characters is simple to do. By examining the original text file in a text editor, we can see that it has 196 lines and that the last line is also a meaningless row of “——”. Alternatively, if the text file is too unwieldly to read, we can peak at it with this:
tail(tour.info, 10)
## V1
## 187 -----------------------------------------------------------------------------------------
## 188 62
## 189 MI
## 190 -----------------------------------------------------------------------------------------
## 191 63
## 192 MI
## 193 -----------------------------------------------------------------------------------------
## 194 64
## 195 MI
## 196 -----------------------------------------------------------------------------------------
## V2 V3 V4 V5 V6 V7 V8
## 187
## 188 ASHWIN BALAJI 1.0 W 55 U U U U
## 189 15219542 / R: 1530 ->1535 B
## 190
## 191 THOMAS JOSEPH HOSMER 1.0 L 2 L 48 D 49 L 43 L 45
## 192 15057092 / R: 1175 ->1125 W B W B B
## 193
## 194 BEN LI 1.0 L 22 D 30 L 31 D 49 L 46
## 195 15006561 / R: 1163 ->1112 B W W B W
## 196
## V9 V10 V11
## 187 NA
## 188 U U NA
## 189 NA
## 190 NA
## 191 H U NA
## 192 NA
## 193 NA
## 194 L 42 L 54 NA
## 195 B B NA
## 196 NA
Again, row 196 is the last row and is nothing but the character “——”. Because each entry is a pair, and each pair of entries is separated by dashes, there is an obvious pattern we can exploit. Each dash line occurs three lines after the previous. Let us remove them this way:
rows.to.delete <- seq(from = 1, to = 196, by = 3)
tour.info <- tour.info[-rows.to.delete, ]
head(tour.info, 10)
## V1 V2 V3 V4 V5 V6 V7
## 2 Pair Player Name Total Round Round Round Round
## 3 Num USCF ID / Rtg (Pre->Post) Pts 1 2 3 4
## 5 1 GARY HUA 6.0 W 39 W 21 W 18 W 14
## 6 ON 15445895 / R: 1794 ->1817 N:2 W B W B
## 8 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17
## 9 MI 14598900 / R: 1553 ->1663 N:2 B W B W
## 11 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21
## 12 MI 14959604 / R: 1384 ->1640 N:2 W B W B
## 14 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26
## 15 MI 12616049 / R: 1716 ->1744 N:2 W B W B
## V8 V9 V10 V11
## 2 Round Round Round NA
## 3 5 6 7 NA
## 5 W 7 D 12 D 4 NA
## 6 W B W NA
## 8 W 16 W 20 W 7 NA
## 9 B W B NA
## 11 W 11 W 13 W 12 NA
## 12 W B W NA
## 14 D 5 W 19 D 1 NA
## 15 W B B NA
Much better! Now let’s work on extracting only the information we need from the second row of each paired row. We only need Player's State from the first column, and their Pre-Rating Score from the second. For that, we’ll need to use regular expressions.
First, we will clean up the row numbers to better see the pattern we’ll need for the paired rows.
row.names(tour.info) <- 1:nrow(tour.info)
head(tour.info)
## V1 V2 V3 V4 V5 V6 V7
## 1 Pair Player Name Total Round Round Round Round
## 2 Num USCF ID / Rtg (Pre->Post) Pts 1 2 3 4
## 3 1 GARY HUA 6.0 W 39 W 21 W 18 W 14
## 4 ON 15445895 / R: 1794 ->1817 N:2 W B W B
## 5 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17
## 6 MI 14598900 / R: 1553 ->1663 N:2 B W B W
## V8 V9 V10 V11
## 1 Round Round Round NA
## 2 5 6 7 NA
## 3 W 7 D 12 D 4 NA
## 4 W B W NA
## 5 W 16 W 20 W 7 NA
## 6 B W B NA
tail(tour.info)
## V1 V2 V3 V4 V5 V6 V7
## 125 62 ASHWIN BALAJI 1.0 W 55 U U U
## 126 MI 15219542 / R: 1530 ->1535 B
## 127 63 THOMAS JOSEPH HOSMER 1.0 L 2 L 48 D 49 L 43
## 128 MI 15057092 / R: 1175 ->1125 W B W B
## 129 64 BEN LI 1.0 L 22 D 30 L 31 D 49
## 130 MI 15006561 / R: 1163 ->1112 B W W B
## V8 V9 V10 V11
## 125 U U U NA
## 126 NA
## 127 L 45 H U NA
## 128 B NA
## 129 L 46 L 42 L 54 NA
## 130 W B B NA
The only rows we’re concerned about right now starts at row 4 for Gary Hua, ends at row 130 for Ben Li, and are separated by one row. Our sequence will look like this.
rows.to.extract <- seq(from = 4, to = 130, by = 2)
We’ll start with the states, because it’s simpler.
states <- as.character(tour.info[rows.to.extract, 1])
states <- str_replace_all(states, pattern = " ", replacement = "")
states
## [1] "ON" "MI" "MI" "MI" "MI" "OH" "MI" "MI" "ON" "MI" "MI" "MI" "MI" "MI"
## [15] "MI" "MI" "MI" "MI" "MI" "MI" "ON" "MI" "ON" "MI" "MI" "ON" "MI" "MI"
## [29] "MI" "ON" "MI" "ON" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI"
## [43] "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI" "MI"
## [57] "MI" "MI" "MI" "MI" "ON" "MI" "MI" "MI"
Now let’s get the scores. The obvious delimiters should be the R: that precedes the pre-score, and the -> that separates the pre and post score. However, we know from our very first example that some scores have characters in them such as 1436P23. Also, some scores do not break a thousand and are padded with two spaces instead of one. We can simplify the regex like so: "R:\\s+\\d+.
prescores <- str_extract(tour.info[rows.to.extract, 2], pattern = "R:\\s+\\d+")
prescores <- as.integer(str_replace_all(prescores, pattern = "R:\\s+", replacement = ""))
prescores
## [1] 1794 1553 1384 1716 1655 1686 1649 1641 1411 1365 1712 1663 1666 1610
## [15] 1220 1604 1629 1600 1564 1595 1563 1555 1363 1229 1745 1579 1552 1507
## [29] 1602 1522 1494 1441 1449 1399 1438 1355 980 1423 1436 1348 1403 1332
## [43] 1283 1199 1242 377 1362 1382 1291 1056 1011 935 1393 1270 1186 1153
## [57] 1092 917 853 967 955 1530 1175 1163
Next we need to convert the opponents into integers. For this assignment we don’t care whether a player won, lost, had a bye, or any other stat. We only care about who their opponent was, if they had one at all. To make this even easier, we can now delete the rows we no longer need.
tour.info <- tour.info[-c(1, 2, rows.to.extract), ]
tour.info$V12 <- prescores
op.cols <- c(4, 5, 6, 7, 8, 9, 10)
for(i in op.cols){
tour.info[, i] <- as.integer(str_extract(tour.info[, i], pattern = "\\s+\\d*"))
}
tour.info[is.na(tour.info)] <- 0
row.names(tour.info) <- 1:nrow(tour.info)
colnames(tour.info) <- c("PairID", "Player.Name", "TotalPts", "R1", "R2", "R3", "R4", "R5", "R6", "R7", "Player.State", "PreChess")
Finally we’ve got a clean data frame. All that’s left is to calculate the Average Pre Chess Rating of Opponents.
tour.info$Avg.Opp.Rating <- ""
avg.rating <- function(x, columns){
# Get playerIDs
for(i in 1:64){
opponents <- x[i, columns]
# Get Pre Chess rating for ID (if applicable)
num.op <- 0
sum <- 0
for(j in opponents){
if(j != 0){
sum <- sum + x[j, 12]
num.op <- num.op + 1
}
}
# Get Average
avg <- sum/num.op
# Insert into column
x[i, 13] <- avg
}
return(x)
}
tour.info <- avg.rating(tour.info, op.cols)
head(tour.info, 10)
## PairID Player.Name TotalPts R1 R2 R3 R4 R5 R6 R7
## 1 1 GARY HUA 6.0 39 21 18 14 7 12 4
## 2 2 DAKSHESH DARURI 6.0 63 58 4 17 16 20 7
## 3 3 ADITYA BAJAJ 6.0 8 61 25 21 11 13 12
## 4 4 PATRICK H SCHILLING 5.5 23 28 2 26 5 19 1
## 5 5 HANSHI ZUO 5.5 45 37 12 13 4 14 17
## 6 6 HANSEN SONG 5.0 34 29 11 35 10 27 21
## 7 7 GARY DEE SWATHELL 5.0 57 46 13 11 1 9 2
## 8 8 EZEKIEL HOUGHTON 5.0 3 32 14 9 47 28 19
## 9 9 STEFANO LEE 5.0 25 18 59 8 26 7 20
## 10 10 ANVIT RAO 5.0 16 19 55 31 6 25 18
## Player.State PreChess Avg.Opp.Rating
## 1 0 1794 1605.28571428571
## 2 0 1553 1469.28571428571
## 3 0 1384 1563.57142857143
## 4 0 1716 1573.57142857143
## 5 0 1655 1500.85714285714
## 6 0 1686 1518.71428571429
## 7 0 1649 1372.14285714286
## 8 0 1641 1468.42857142857
## 9 0 1411 1523.14285714286
## 10 0 1365 1554.14285714286
Almost there. We just need to round the numbers, as indicated in the prompt.
tour.info$Avg.Opp.Rating <- round(as.integer(tour.info[,13]))
Now we write the required information to a .CSV.
final <- subset(tour.info, select = c("Player.Name", "Player.State", "TotalPts", "PreChess", "Avg.Opp.Rating"))
head(final, 10)
## Player.Name Player.State TotalPts PreChess
## 1 GARY HUA 0 6.0 1794
## 2 DAKSHESH DARURI 0 6.0 1553
## 3 ADITYA BAJAJ 0 6.0 1384
## 4 PATRICK H SCHILLING 0 5.5 1716
## 5 HANSHI ZUO 0 5.5 1655
## 6 HANSEN SONG 0 5.0 1686
## 7 GARY DEE SWATHELL 0 5.0 1649
## 8 EZEKIEL HOUGHTON 0 5.0 1641
## 9 STEFANO LEE 0 5.0 1411
## 10 ANVIT RAO 0 5.0 1365
## Avg.Opp.Rating
## 1 1605
## 2 1469
## 3 1563
## 4 1573
## 5 1500
## 6 1518
## 7 1372
## 8 1468
## 9 1523
## 10 1554
write.csv(final, file = "chessinfo.csv")