This project calls for parsing a raw text file composed of chess tournament scores. Specs call or retrieving each player’s name, state, total points and pre-tournament rating, then calculating the average pre-tournament rating for each player’s opponents. The refined data are to be exported in a .csv file for use elsewhere.
We will use eight R packages. Most of the work will involve regular expressions in conjunction with extract_all function in stringr(). For fun, we’ll plot some of the cleaned-up data using ggplot2().
Here is the list of packages followed by a peek at the original fixed-width text file.
library(knitr)
library(stringr)
library(ggplot2)
library(ggthemes)
library(scales)
library(pander)
panderOptions("digits", 0)
panderOptions("table.style", "rmarkdown")
panderOptions("table.alignment.default", "left")
Each player’s information is organized in tabular fashion and split across multiple lines. Our strategy is to split the data vertically into two parts by creating an index, then parsing each portion to get the information we need.
input <- readLines("tournamentinfo.txt")
head(input, 12)
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|"
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |"
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [7] "-----------------------------------------------------------------------------------------"
## [8] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [9] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [10] "-----------------------------------------------------------------------------------------"
## [11] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [12] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
# create an index to separate the first and second player lines
idx.player <- seq.int(5, 196, by = 3 )
idx.state <- seq.int(6, 195, by = 3)
# separate the two parts
player <- input[idx.player]
states <- input[idx.state]
We’ve created two text objects, “player” and “states”, each containing some of the data we want. This a job for regular expressions and str_extract_all(). We put what we want into temporary storage so we can reunite it later in an R dataframe. It took some fiddling to get the proper regular expression for capturing the names.
# get the player and opponent IDs from the player object
id <- "(\\d{1,2}\\.?\\d*)"
ids <- str_extract_all(player, id)
# get the names from player
name <- "([:alpha:]{2,})(([:blank:][[:alpha:]-?]{1,})?)+"
names <- str_extract_all(player, name)
# get the pre-rating from states
prerate <- ":\\s+\\d{3,4}"
rates <- str_extract_all(states, prerate)
rates <- str_extract_all(rates, "\\d+")
# get the state names from state
state <- "^\\s+[:alpha:]{2}"
state.names <- str_trim(str_extract_all(states, state))
Here’s a peek at the extracted material.
ON, MI, MI and MI
We can now put the filtered material into an R dataframe for export and to compute the average opponent ratings. This step also involves transforming data types. Some players recorded fewer than seven games, so we have to fill NA in those columns.
# first the simple lists
chessdb <- data.frame(cbind(names, state.names, rates))
chessdb$state.names <- as.factor(unlist(chessdb$state.names))
chessdb$names <- as.factor(unlist(chessdb$names))
chessdb$rates <- as.numeric(unlist(chessdb$rates))
# include NA values; StackOverflow helped with some of this code
n.obs <- sapply(ids, length)
seq.max <- seq_len(max(n.obs))
id.vars <- as.data.frame(t(sapply(ids, "[", i = seq.max)), stringsAsFactors=F)
id.vars[, c(1:9)] <- lapply(id.vars[, c(1:9)], as.numeric)
# put them together, deleting an unneeded ID column
chessdb2 <- cbind(chessdb, id.vars[, -1])
colnames(chessdb2)[1:4] <- c("Name", "State", "Pre.Rating", "Total.Pts")
Here’s a view of the data frame in R. I’m following Google’s style for R variable names (though I honestly don’t care for the dot.) The variables V3-V9 will be used in the next step to index player pre-ratings and compute the opponent average.
## Name State Pre.Rating Total.Pts V3 V4 V5 V6 V7 V8 V9
## 1 GARY HUA ON 1794 6.0 39 21 18 14 7 12 4
## 2 DAKSHESH DARURI MI 1553 6.0 63 58 4 17 16 20 7
## 3 ADITYA BAJAJ MI 1384 6.0 8 61 25 21 11 13 12
## 4 PATRICK H SCHILLING MI 1716 5.5 23 28 2 26 5 19 1
This step required writing a bespoke function to look up the pre-match rating for each contestant’s opponents, then calculate their average rating. The apply family of functions didn’t seem to work no matter what I tried.
get.score2 <- function(db){
# computes the average score for a contestant's opponent
#
# Args:
# db: a dataframe, in this case our bespoke df dervied from tournamentinfo.txt
#
# Note: Not applicable generally; this is crafted just for this purpose
# Returns: A vector of average scores
list <- NULL
for (i in 1:nrow(db)){
num <- 0
denom <- 0
for (j in 5:11){
opponent <- as.numeric(db[i,j])
if (is.na(opponent)){ # exits the loop on NA value
next
} else {}
rating <- db[opponent, 3]
num <- num + rating
denom <- denom + 1
}
avg <- round(num/denom, 0)
list <- append(list, avg)
}
return(list)
}
# append the result to our working data frame
chessdb2$Opp.Avg <- get.score2(chessdb2)
The data is ready for export to csv.
## Name State Pre.Rating Total.Pts Opp.Avg
## 1 GARY HUA ON 1794 6.0 1605
## 2 DAKSHESH DARURI MI 1553 6.0 1469
## 3 ADITYA BAJAJ MI 1384 6.0 1564
## 4 PATRICK H SCHILLING MI 1716 5.5 1574
## 5 HANSHI ZUO MI 1655 5.5 1501
## 6 HANSEN SONG OH 1686 5.0 1519
The most trivial step in the process.
# export our csv file
write.csv(chessdb2[, c(1:4,12)], "chess.csv")
list.files()[1]
## [1] "chess.csv"
Players’ pre-contest ratings have a mild positive correlation (0.284) with opponents’ average rating.
# for fun, try a plot
mytufte <- theme_set(theme_tufte())
mytufte <- theme_update(axis.title.x = element_text(hjust=0.07),
axis.title.y = element_text())
ggplot(chessdb2, aes(x=Pre.Rating, y=Opp.Avg)) +
geom_point(aes(color=State), size=2) +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(labels = scales::comma) +
geom_smooth(method='lm', level=.90) +
xlab("Pre-match Rating") + ylab("Opponent Mean Rating") +
theme_set(mytufte)
This regular expression guide by Gaston Sanchez was very instructive.
So was RegExr.com.
And this post on StackOverflow.
Finally, Google’s R style guide.