In this project, I was given a text file with chess tournament results where the information has some structure. The task is to create an R Markdown file that generates a .CSV file with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents
For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605
1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.
Before I explain my whole approach to this task I will do a bit of preliminary work to set up my environment. I will load the libraries I will be using:
library(readr)
library(stringr)
library(dplyr)
Next I will read in the chess data using readr’s read_lines function which loads each line into a character vector.
lines <- read_lines("https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/tournamentinfo.txt")
lines[1:7]
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [7] "-----------------------------------------------------------------------------------------"
I will loop through the character vector. I will check if the line is made up of dashes. After reading two dashed lines I will start extracting data. There are two lines that have useful data. The first has the player’s id, name and match data. The second line has the player’s state and starting score.
I will extract the player data into a list then load that into a dataframe. The match data will be a set of player ids that I will join to the player scores to compute the average. Once I have these averages I will merge that in with the player data.
The name data is in all capital letters. We will need to function that will capitalize the first letter.
capitalize_first_letter <- function(x) {
x <- tolower(x)
substr(x, 1, 1) <- toupper(substr(x, 1, 1))
x
}
Now we are ready to process the file. First we will initialize some variables.
# Initialize counters
dash_lines_found <- 0
data_line_count <- 0
# Initialize the extract data flag
extract_data <- FALSE
# Initialize a vector that will hold the match data
matches <- c()
Now for the heart of the project: the for loop.
# Loop through line by line
for (line in lines){
# Look for the dash line
if (line == "-----------------------------------------------------------------------------------------"){
# Increment the dash line count
dash_lines_found <- dash_lines_found + 1
# Check if we need to reset the data line counter
if (extract_data){
data_line_count <- 0
}
# Check if we need to extract data
if (dash_lines_found == 2){
extract_data <- TRUE
}
} else if (extract_data) {
# Increase the data line count by one
data_line_count <- data_line_count + 1
# Split the data by the pipe character
split_line <- unlist(strsplit(line, "\\|"))
# Extract the data
if (data_line_count == 1){
# This is the first line so extract the player's id
player_id <- str_extract(split_line[1], "[[:digit:]]+")
# Extract name, split name into parts, capitalize the first letter and
# put it back together
player_name <- str_extract_all(split_line[2], "[[:alpha:]]+")
player_name <- unlist(player_name)
player_name <- capitalize_first_letter(player_name)
player_name <- paste(player_name, collapse = " ")
# Extract total number of points
player_total_points <- as.numeric(trimws(split_line[3]))
# Loop through the match results
for (game in split_line[4:10]){
# Extract if it is a win, loss, draw, etc.
result <- str_extract(game, "[[:alpha:]]")
# Extract the opponent's id
opponent_id <- str_extract(game, "[[:digit:]]+")
# Add the data to the match vector
matches <- rbind(matches, c(player_id, opponent_id, result))
}
} else {
# This is the second line so Get the player's rating
player_rating <- unlist(str_extract_all(split_line[2], "[[:digit:]]+"))
player_rating <- as.numeric(player_rating[2])
# Get the player's state
player_state <- str_extract(split_line[1], "[[:alpha:]]+")
# Pull all the extracted player data together into a list
player_data <- list("player_id" = player_id,
"players_name" = player_name,
"players_state" = player_state,
"player_total_points" = player_total_points,
"players_pre_rating" = player_rating)
# Turn that list into a dataframe with one row
player_row <- data.frame(bind_rows(player_data))
# Add it to the players dataframe
if (!exists("player_df")){
player_df <- player_row
} else {
player_df <- rbind(player_df, player_row)
}
}
}
}
Now that the data is extracted we need to take a look at it. Here’s what the extracted player data looks like:
head(player_df)
## player_id players_name players_state player_total_points
## 1 1 Gary Hua ON 6.0
## 2 2 Dakshesh Daruri MI 6.0
## 3 3 Aditya Bajaj MI 6.0
## 4 4 Patrick H Schilling MI 5.5
## 5 5 Hanshi Zuo MI 5.5
## 6 6 Hansen Song OH 5.0
## players_pre_rating
## 1 1794
## 2 1553
## 3 1384
## 4 1716
## 5 1655
## 6 1686
We also have the match data:
matches[1:6,]
## [,1] [,2] [,3]
## [1,] "1" "39" "W"
## [2,] "1" "21" "W"
## [3,] "1" "18" "W"
## [4,] "1" "14" "W"
## [5,] "1" "7" "W"
## [6,] "1" "12" "D"
Now that we have extracted the data we move into the post-processing stage. First we need to turn the match results data into a dataframe, rename the variables and add in the player ranking data.
# Take the match results
matches <- matches %>%
data.frame() %>%
rename(player_id = X1,
opponent = X2,
results = X3)
# Add in the player's initial rating
matches <- player_df %>%
rename(opponent = player_id) %>%
select(opponent, players_pre_rating) %>%
merge(matches)
Here’s what our match results dataframe looks like
head(matches)
## opponent players_pre_rating player_id results
## 1 1 1794 7 L
## 2 1 1794 39 L
## 3 1 1794 18 L
## 4 1 1794 12 D
## 5 1 1794 4 D
## 6 1 1794 21 L
Now we need to compute the averages. We need to group the data by player id and compute an average using a dplyr pipeline.
averages <- matches %>%
group_by(player_id) %>%
summarise(average = mean(players_pre_rating))
Here’s what the averages looks like:
head(averages)
## # A tibble: 6 x 2
## player_id average
## <fct> <dbl>
## 1 1 1605.
## 2 10 1554.
## 3 11 1468.
## 4 12 1506.
## 5 13 1498.
## 6 14 1515
Now we need to merge the averages to the player dataframe and drop the player id so it matches what is asked for in the description. The averages are doubles, not integers like the example so we will adjust that
df <- merge(player_df, averages) %>%
mutate(average_opponent_pre_rating = as.integer(average)) %>%
select(-player_id, -average)
Here’s what the dataframe looks like:
head(df)
## players_name players_state player_total_points
## 1 Gary Hua ON 6.0
## 2 Anvit Rao MI 5.0
## 3 Cameron William Mc Leman MI 4.5
## 4 Kenneth J Tack MI 4.5
## 5 Torrance Henry Jr MI 4.5
## 6 Bradley Shaw MI 4.5
## players_pre_rating average_opponent_pre_rating
## 1 1794 1605
## 2 1365 1554
## 3 1712 1467
## 4 1663 1506
## 5 1666 1497
## 6 1610 1515
Now that the player data is in the form specified we need to save it as a CSV.
write_csv(player_df, 'data/chess.csv')
I really enjoy graph theory and couldn’t pass up the oportunity to create a network graph from this chess data. I recently learned how to use the igraph library from DataCamp so I will practice my skills on this data. This is beyond the scope of the project and is for the fun of it.
I want the graph to use the players names as the ids
library(igraph)
players_names_1 <- player_df %>%
select(player_id, players_name)
players_names_2 <- player_df %>%
rename(opponent = player_id,
opponents_name = players_name) %>%
select(opponent, opponents_name)
G <- matches %>%
select(player_id, opponent) %>%
merge(players_names_1) %>%
merge(players_names_2) %>%
select(players_name, opponents_name) %>%
graph_from_data_frame(., directed = FALSE) %>%
simplify()
Now that we have loaded up the data into a graph let’s see who Gary Hua played
neighbors(G, 'Gary Hua')
## + 7/64 vertices, named, from 75df4ad:
## [1] Gary Dee Swathell Kenneth J Tack Patrick H Schilling
## [4] Bradley Shaw Joel R Hendon David Sundeen
## [7] Dinh Dang Bui
One last thing. I want to see if any of the people who played Gary Hua played each other. I want to visualize it.
G_sub <- induced_subgraph(G, neighbors(G, 'Gary Hua'))
layout <- layout.reingold.tilford(G_sub, circular=T)
plot(G_sub, layout=layout, vertex.size=20, vertex.color='white', vertex.frame.color = 'cornflower blue')
It looks like Joel R Hendon and Dinh Dang Bui were the only two who played each other and Gary Hua.