In this project, we will parse information out of a text file containing the results of a chess tournament. This file is read in as a formatted text file.
To keep this interesting, we’ll parameterize as much as we can and make this as recyclable as possible and be able to run this as an ongoing data pipeline.
As we’ve saved this one sample file to GitHub we will be able to access this file directly through the url:
file_url <- "https://raw.githubusercontent.com/riverar9/cuny-msds/main/data607/projects/project-1/tournamentinfo.txt" # nolint
tournament_string <- readLines(file_url)
This first function is very simple, its purposes is to essentially perform a str_extract_all but ignore any empty strings. This will return an array of all the matches and nothing else.
parse_pattern_with_regex <- function(string, pattern) {
# Import the string r library
library(stringr)
# create an array with each match and then trim each match
matches <- str_extract_all(string, pattern, simplify = TRUE)
matches <- trimws(matches)
# Initialize an empty character vector
output <- c()
# Iterate through each match, appending only non-empty strings.
for (element in matches) {
if (element != "") {
output <- c(output, element)
}
}
return(output)
}
These next two functions both take in a tournament string and parse information.
The results of these functions are:
In order to extract the information, we will use regex. By using regex, we can identify patterns in the original text document and use these patterns to our advantage by creating flags and matches that regex will use to extract our information.
The patterns for these are:
parse_player_data <- function(ts) {
library(tidyr)
library(magrittr)
# Establish the patterns for the data we want to collect
id_pattern <- r"(^\s*[0-9]+\s(?=\|))"
name_pattern <- r"((?<=\|)([A-Z\s\-]+)(?=\|\d))"
uscfid_pattern <- r"(\d{8})"
state_pattern <- r"(^\s*([A-Z]{2})\s(?=\|))"
points_pattern <- r"((?<=\|)[\d\.]+\s*(?=\|))"
prerating_pattern <- r"((?<=(\:))\s+(\d+)(?=\s|[A-Z]))"
postrating_pattern <- r"((?<=\>)\s*(\d+))"
# Create a list of vectors for the player data parsed
player_data_list <- list(
player_id = parse_pattern_with_regex(ts, id_pattern),
player_name = parse_pattern_with_regex(ts, name_pattern),
player_uscfid = parse_pattern_with_regex(ts, uscfid_pattern),
player_state = parse_pattern_with_regex(ts, state_pattern),
player_points = parse_pattern_with_regex(ts, points_pattern),
player_prerating = parse_pattern_with_regex(ts, prerating_pattern),
player_postrating = parse_pattern_with_regex(ts, postrating_pattern)
)
# Convert the player information into a dataframe
player_df <- as.data.frame(player_data_list)
# Convert player_id, player_uscfid, player_points, player_prerating,
# and player_postrating to numeric
player_df$player_id <- as.numeric(
player_df$player_id
)
player_df$player_uscfid <- as.numeric(
player_df$player_uscfid
)
player_df$player_points <- as.numeric(
player_df$player_points
)
player_df$player_prerating <- as.numeric(
player_df$player_prerating
)
player_df$player_postrating <- as.numeric(
player_df$player_postrating
)
return(player_df)
}
parse_match_results <- function(ts) {
library(tidyr)
library(magrittr)
# Establish the patterns for the data we want to collect
id_pattern <- r"(^\s*[0-9]+\s(?=\|))"
results_pattern <- r"((?<=\.\d\s{2}\|).*(?=\|))"
# Obtain the match data
player_match_data_list <- list(
player_id = parse_pattern_with_regex(ts, id_pattern),
match_results = parse_pattern_with_regex(ts, results_pattern)
)
# Convert the matches data into a dataframe
matches_df <- as.data.frame(player_match_data_list)
# Working with the matches dataframe, we will need to transform the data so
# that each row represents a player_id, an opponent_player_id, and the game result
round_numbers <- c(
"1",
"2",
"3",
"4",
"5",
"6",
"7"
)
matches_df %<>%
separate_wider_delim(
cols = match_results,
delim = "|",
names = round_numbers
) %<>%
pivot_longer(
!player_id,
names_to = "round_number",
values_to = "match_result"
) %<>%
separate_wider_delim(
cols = match_result,
delim = " ",
names = c("round_outcome", "opponent_player_id"),
too_few = "align_start",
too_many = "merge"
)
# Convert the opponent_player_id, player_id, and round_number
# columns to numeric
matches_df$opponent_player_id <- as.numeric(
matches_df$opponent_player_id
)
matches_df$player_id <- as.numeric(
matches_df$player_id
)
matches_df$round_number <- as.numeric(
matches_df$round_number
)
return(matches_df)
}
With the functions defined, we will simply use them to create our two dataframes.
players_df <- parse_player_data(tournament_string)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
matches_df <- parse_match_results(tournament_string)
While building this function, I learned about using non-standard evaluations. This learning has enabled me to turn some of the common steps I take into something replicable.
This is great for accessing column names that are stored as a variable. To do this, we will simply need to use !!sym([variable_name]) or in the case of the mutate function from below !!as.character([temp_id_name]).
This function may be a bit overkill for this project but will likely come in handy in the future.
normalize_datasets <- function(input_df, fields_to_normalize) { # nolint: line_length_linter.
library(dplyr)
# Use setdiff to find which column names do not exist in the
# dataframe's columns
missing_columns <- setdiff(fields_to_normalize, names(input_df))
# If there are missing columns, stop everything and warn the user.
if (length(missing_columns) > 0) {
stop(
paste(
"Columns not found in this dataframe:\n\t",
paste(missing_columns, collapse = "\n\t")
)
)
}
# Create a collection variable to store outputs
dataset_colleciton <- list()
# Iterate through the columns and create the reference dataframes
for (each_field in fields_to_normalize) {
# intialize an id field name
temp_id_name <- paste(c(each_field, "_id"), collapse = "") # nolint
# Create the reference table
temp_ref_df <- input_df |>
distinct(!!sym(each_field)) |>
mutate(!!as.character(temp_id_name) := row_number())
# perform this mapping onto input_df and mutate it to retain the id only
input_df <- subset(
merge(input_df, temp_ref_df, by = each_field)
, select = (colnames(input_df)[colnames(input_df) != each_field])
)
# Add this reference to the dataset_collection
dataset_colleciton[[each_field]] <- temp_ref_df
}
# With each column normalized, we will return the list with the
# modified input_df
dataset_colleciton[["input_result"]] <- input_df
return(dataset_colleciton)
}
With the function built, let’s inspect our datasets for which fields we should use for our model:
str(players_df)
## 'data.frame': 64 obs. of 7 variables:
## $ player_id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ player_name : chr "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ" "PATRICK H SCHILLING" ...
## $ player_uscfid : num 15445895 14598900 14959604 12616049 14601533 ...
## $ player_state : chr "ON" "MI" "MI" "MI" ...
## $ player_points : num 6 6 6 5.5 5.5 5 5 5 5 5 ...
## $ player_prerating : num 1794 1553 1384 1716 1655 ...
## $ player_postrating: num 1817 1663 1640 1744 1690 ...
str(matches_df)
## tibble [448 × 4] (S3: tbl_df/tbl/data.frame)
## $ player_id : num [1:448] 1 1 1 1 1 1 1 2 2 2 ...
## $ round_number : num [1:448] 1 2 3 4 5 6 7 1 2 3 ...
## $ round_outcome : chr [1:448] "W" "W" "W" "W" ...
## $ opponent_player_id: num [1:448] 39 21 18 14 7 12 4 63 58 4 ...
From inspecting the results of both of these, the candidates to use are:
A candidate was determined by looking at the column and seeing which ones have a one to many relationship. By looking at the str(df) results, we can see that from just the sample provided there are already a few which show up (player_state and round_outcome)
Not very many entries to be reduced. Honestly, this was not even worth reducing as it’s unecessary added complexity. So, at this point I decided to scrap the idea and just decided to continue in a simpler fashion. For the sake of completion though, I’ll include an appendix section where I go through this process.
Thankfully, with the work we’ve done by creating matches_df, we should be able to easily obtain this by joining matches_df to players_df where opponent_player_id = player_id. With that result, we should be able to simply group by the player_id and average the opponent players’ preratings.
A question that comes up is what to do when the player had a round with no opponent, for example, a bye round. For this case, I believe it makes the most sense to not include anything for that round. I chose this method because the metric we are creating is their average opponent’s score and adding a 0 will imply that they faced an opponent with a score of 0. Luckily, by using the merge command (an inner join), we will take care of this immediately and remove any matches with a NA opponent_player_id.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
pre_rating_df <- players_df %>%
select(player_id, player_prerating)
prerating_matches_df <- merge(
matches_df,
pre_rating_df,
by.x = "opponent_player_id",
by.y = "player_id"
)
average_opp_rating <- prerating_matches_df |>
group_by(player_id) |>
summarise(average_opponent_prerating = mean(player_prerating))
average_opp_rating |>
filter(player_id == 1)
## # A tibble: 1 × 2
## player_id average_opponent_prerating
## <dbl> <dbl>
## 1 1 1605.
Verifying with the pdf provided for this project, we can see that for Gary Hua (player_id = 1), we are calculating the “Average Pre Chess Rating of Opponents” correctly.
Now that we have everything, we can go ahead and create the dataset we need and save it. The dataset we need is:
In order to get this information, we will need to perform a merge on the players_df and the average_opp_rating dataframes and then select the columns mentioned above:
final_df <- merge(players_df, average_opp_rating, by = "player_id") |>
select(
player_name,
player_state,
player_points,
player_prerating,
average_opponent_prerating
)
head(final_df)
## player_name player_state player_points player_prerating
## 1 GARY HUA ON 6.0 1794
## 2 DAKSHESH DARURI MI 6.0 1553
## 3 ADITYA BAJAJ MI 6.0 1384
## 4 PATRICK H SCHILLING MI 5.5 1716
## 5 HANSHI ZUO MI 5.5 1655
## 6 HANSEN SONG OH 5.0 1686
## average_opponent_prerating
## 1 1605.286
## 2 1469.286
## 3 1563.571
## 4 1573.571
## 5 1500.857
## 6 1518.714
Now let’s write final_df to a csv locally:
write.csv(final_df, "players_average_opponent_prerating.csv", row.names = FALSE)
In this project, we were given a text file which held the information of chess players and the results of up to 7 rounds of playing. With the text file, we’ve developed two functions to parse this data out using regex and create two dataframes, one for players and their information and another for the rounds that was played. With this, I invesitaged to see if this data could benefit from breaking it down into more reference tables but it turned out that it there wasn’t much available although I fully intend on reusing the function in future projects.
Lastly, we calculated the average prescore of a player’s opponents and wrote that out along with the match data and the players data.
As promised and for completion, here is the appendix where I go through the columns identified in section 4.
players_datasets <- normalize_datasets(players_df, c("player_state"))
state_ref <- players_datasets$player_state
res_players_df <- players_datasets$input_result
state_ref
## player_state player_state_id
## 1 ON 1
## 2 MI 2
## 3 OH 3
Now we can see 3 states here. Perhaps it was a local tournament?
matches_datasets <- normalize_datasets(matches_df, c("round_outcome"))
res_matches_df <- matches_datasets$input_result
outcome_ref <- matches_datasets$round_outcome
outcome_ref
## # A tibble: 7 × 2
## round_outcome round_outcome_id
## <chr> <int>
## 1 W 1
## 2 D 2
## 3 L 3
## 4 H 4
## 5 U 5
## 6 B 6
## 7 X 7
Here we can easily see that there were 7 types of round_outcomes. Although I don’t know what they all mean, we can see that the tables are pretty normalized now. That is to say, that we have an ID for any one to many relationships.
This will be helpful in the future to do the same and create relational database models.