In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents
A sample of the .txt format is provided below.
# ----------------------------------------------------------------------------------------- #
# 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| #
# ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W | #
# ----------------------------------------------------------------------------------------- #
First, we read in the data. The .txt file is captured as a single character string.
txt <- read_file(paste0('https://raw.githubusercontent.com/kac624/',
'cuny/main/D607/data/tournamentinfo.txt'))
Next, we perform set-up before parsing the .txt file. This includes (i) removing line breaks from the character string, (ii) capturing the locations of long strings of dashes (“-”) that separate entries for different players, and (iii) setting up an empty dataframe that will hold our target variables.
txt2 <- str_replace_all(txt, c('\n'=''))
locations <- str_locate_all(txt2, paste(replicate(89, "-"),
collapse = "")
)[[1]]
df <- data.frame(matrix(ncol = 11, nrow = 0))
x <- c('name', 'state', 'total_pts', 'pre_rating', 'oppnt1',
'oppnt2', 'oppnt3', 'oppnt4', 'oppnt5', 'oppnt6', 'oppnt7')
colnames(df) <- x
The following loop parses the data using the str_match function and regular expressions. The loop iterates on every “row” in the .txt file (based on previously identified locations), skipping the first row (which is a header).
Finally, all these variables are moved into the previously created dataframe.
for (i in 2:(nrow(locations) - 1)) {
row <- str_sub(txt2, locations[i,2] + 1, locations[i+1,1] - 1)
name <- trimws(str_match(row, '[0-9] \\| (.*?) \\|[0-9]')[,2])
state <- str_match(row, ' ([A-Z]{2}) \\| [0-9]')[,2]
total_pts <- str_match(row, '\\|([0-9]\\.[0-9]) \\|')[,2]
pre_rating <- str_match(row, 'R:( | )([0-9]+|[0-9]+P[0-9])')[,3]
for (j in 1:7) {
assign(paste0('oppnt', j),
str_match(row, paste0('(\\|[A-Z]\\s+( |[0-9]+)){',j,'}'))[,3])
}
df[i-1, ] <- c(name, state, total_pts, pre_rating, oppnt1,
oppnt2, oppnt3, oppnt4, oppnt5, oppnt6, oppnt7)
}
We still have one final variable to capture: Average Pre-Rating of Opponents. First, we convert the opponent player IDs to integers, so they can be used for indexing. Next, we loop through each row to calculate the average of all opponent player ratings, using the following steps.
for (i in colnames(df)[4:11]) {
df[,i] <- as.integer(df[,i])
}
for (i in 1:nrow(df)) {
oppnts <- df[i,5:11]
ratings <- c()
for (j in 1:length(oppnts)) {
if (!is.na(oppnts[[j]])) {
ratings <- c(ratings, df[oppnts[[j]], 'pre_rating'])
}
}
avg_oppnt_rtg <- mean(ratings)
df[i, 'avg_oppnt_rtg'] <- avg_oppnt_rtg
}
The resulting dataframe is previewed below, and exported to .CSV.
head(df)
## name state total_pts pre_rating oppnt1 oppnt2 oppnt3 oppnt4
## 1 GARY HUA ON 6.0 1794 39 21 18 14
## 2 DAKSHESH DARURI MI 6.0 1553 63 58 4 17
## 3 ADITYA BAJAJ MI 6.0 1384 8 61 25 21
## 4 PATRICK H SCHILLING MI 5.5 1716 23 28 2 26
## 5 HANSHI ZUO MI 5.5 1655 45 37 12 13
## 6 HANSEN SONG OH 5.0 1686 34 29 11 35
## oppnt5 oppnt6 oppnt7 avg_oppnt_rtg
## 1 7 12 4 1605.286
## 2 16 20 7 1469.286
## 3 11 13 12 1563.571
## 4 5 19 1 1573.571
## 5 4 14 17 1500.857
## 6 10 27 21 1518.714
write_csv(df, 'output/proj1_chessRatings.csv')
I’ve constructed four plots below to assess the relationship between players’ ratings and their performance. Performance is measured in all cases by ‘Total Points’ they obtained during the tournament. The plots compare Total Points to (i) pre-rating, (ii) the difference between pre-rating and average opponent pre-rating, (iii) the sum of differences between pre-rating and each opponent’s pre-rating, and (iv) the number of matches in which they had a higher pre-rating than their opponent.
Unfortunately, none of these views provide evidence of a very clear relationship. All tend to point to a positive relationship between pre-rating and performance, as one might expect. But there appears to be a lot of noise in each plot, indicating that these relationships are not very strong.
ggplot(df, aes(x = pre_rating, y = total_pts)) +
geom_point()
df <- df %>%
mutate(avg_rating_diff = pre_rating - avg_oppnt_rtg)
ggplot(df, aes(x = avg_rating_diff, y = total_pts)) +
geom_point()
for (i in 1:nrow(df)) {
oppnts <- df[i,5:11]
ratings <- c()
for (j in 1:length(oppnts)) {
if (!is.na(oppnts[[j]])) {
ratings <- c(ratings, df[oppnts[[j]], 'pre_rating'])
}
}
ratings_compare <- df[i,'pre_rating'] - ratings
df[i, 'ratings_advantage'] <- sum(ratings_compare)
df[i, 'advantaged_matches'] <- length(ratings_compare[ratings_compare>0])
}
ggplot(df, aes(x = ratings_advantage, y = total_pts)) +
geom_point()
ggplot(df, aes(x = advantaged_matches, y = total_pts)) +
geom_point()