Overview

In this project, we will parse information out of a text file containing the results of a chess tournament. This file is read in as a formatted text file.

To keep this interesting, we’ll parameterize as much as we can and make this as recyclable as possible and be able to run this as an ongoing data pipeline.

1. Reading the text file as a string

As we’ve saved this one sample file to GitHub we will be able to access this file directly through the url:

file_url <- "https://raw.githubusercontent.com/riverar9/cuny-msds/main/data607/projects/project-1/tournamentinfo.txt" # nolint

tournament_string <- readLines(file_url)

2. Creating functions to extract tournament information

This first function is very simple, its purposes is to essentially perform a str_extract_all but ignore any empty strings. This will return an array of all the matches and nothing else.

parse_pattern_with_regex <- function(string, pattern) {
  # Import the string r library
  library(stringr)

  # create an array with each match and then trim each match
  matches <- str_extract_all(string, pattern, simplify = TRUE)
  matches <- trimws(matches)

  # Initialize an empty character vector
  output <- c()

  # Iterate through each match, appending only non-empty strings.
  for (element in matches) {
    if (element != "") {
      output <- c(output, element)
    }
  }

  return(output)
}

These next two functions both take in a tournament string and parse information.

The results of these functions are:

  1. parse_player_data = Returns a dataframe of each player and their information.
  2. parse_match_results = Returns a dataframe of matches and their result by the player who participated in the match.

In order to extract the information, we will use regex. By using regex, we can identify patterns in the original text document and use these patterns to our advantage by creating flags and matches that regex will use to extract our information.

The patterns for these are:

  1. Player Id - this pattern is any number prefixed by a new line character and any number of spaces while being suffixed by one space and a “|” character.
  2. Name - this pattern is an all capital string with any number of spaces and any number of hyphens that is prefixed by a “|” and suffixed by a “|” character and a digit character.
  3. USCF Id - This pattern is any 8 digit number.
  4. State - this pattern is any string of length 2 that is prefixed by a new line character and any number of spaces while being suffixed by one space and a “|” character.
  5. Points - this pattern is any string consisting of a digit and “.” character prefixed by a “|” and suffixed by any number of spaces followed by a “|” character.
  6. Pre-Rating - this pattern is defined as any number of digits prefixed by a “:” character followed by exactly one space and suffixed by either a space or an alphabet character.
  7. Post-Rating - this pattern is defined as a number of at least 1 digit prefixed by a “>” character and any number of space characters.
  8. Round Results - this pattern will first extract everything with the prefix being . being a “.” character followed by a number and exactly 2 spaces. Then it will extract everything until the following new line character.
parse_player_data <- function(ts) {
  library(tidyr)
  library(magrittr)

  # Establish the patterns for the data we want to collect
  id_pattern          <- r"(^\s*[0-9]+\s(?=\|))"
  name_pattern        <- r"((?<=\|)([A-Z\s\-]+)(?=\|\d))"
  uscfid_pattern      <- r"(\d{8})"
  state_pattern       <- r"(^\s*([A-Z]{2})\s(?=\|))"
  points_pattern      <- r"((?<=\|)[\d\.]+\s*(?=\|))"
  prerating_pattern   <- r"((?<=(\:))\s+(\d+)(?=\s|[A-Z]))"
  postrating_pattern  <- r"((?<=\>)\s*(\d+))"

  # Create a list of vectors for the player data parsed
  player_data_list <- list(
    player_id = parse_pattern_with_regex(ts, id_pattern),
    player_name = parse_pattern_with_regex(ts, name_pattern),
    player_uscfid = parse_pattern_with_regex(ts, uscfid_pattern),
    player_state = parse_pattern_with_regex(ts, state_pattern),
    player_points = parse_pattern_with_regex(ts, points_pattern),
    player_prerating = parse_pattern_with_regex(ts, prerating_pattern),
    player_postrating = parse_pattern_with_regex(ts, postrating_pattern)
  )

  # Convert the player information into a dataframe
  player_df <- as.data.frame(player_data_list)

  # Convert player_id, player_uscfid, player_points, player_prerating,
  # and player_postrating to numeric
  player_df$player_id <- as.numeric(
    player_df$player_id
  )

  player_df$player_uscfid <- as.numeric(
    player_df$player_uscfid
  )

  player_df$player_points <- as.numeric(
    player_df$player_points
  )

  player_df$player_prerating <- as.numeric(
    player_df$player_prerating
  )

  player_df$player_postrating <- as.numeric(
    player_df$player_postrating
  )

  return(player_df)
}

parse_match_results <- function(ts) {
  library(tidyr)
  library(magrittr)

  # Establish the patterns for the data we want to collect
  id_pattern          <- r"(^\s*[0-9]+\s(?=\|))"
  results_pattern     <- r"((?<=\.\d\s{2}\|).*(?=\|))"

  # Obtain the match data
  player_match_data_list <- list(
    player_id = parse_pattern_with_regex(ts, id_pattern),
    match_results = parse_pattern_with_regex(ts, results_pattern)
  )

  # Convert the matches data into a dataframe
  matches_df <- as.data.frame(player_match_data_list)

  # Working with the matches dataframe, we will need to transform the data so
  # that each row represents a player_id, an opponent_player_id, and the game result
  round_numbers <- c(
    "1",
    "2",
    "3",
    "4",
    "5",
    "6",
    "7"
  )

  matches_df %<>%
    separate_wider_delim(
      cols = match_results,
      delim = "|",
      names = round_numbers
    ) %<>%
    pivot_longer(
      !player_id,
      names_to = "round_number",
      values_to = "match_result"
    ) %<>%
    separate_wider_delim(
      cols = match_result,
      delim = " ",
      names = c("round_outcome", "opponent_player_id"),
      too_few = "align_start",
      too_many = "merge"
    )

  # Convert the opponent_player_id, player_id, and round_number
  # columns to numeric
  matches_df$opponent_player_id <- as.numeric(
    matches_df$opponent_player_id
  )

  matches_df$player_id <- as.numeric(
    matches_df$player_id
  )

  matches_df$round_number <- as.numeric(
    matches_df$round_number
  )

  return(matches_df)
}

3. Getting Tournament information

With the functions defined, we will simply use them to create our two dataframes.

players_df <- parse_player_data(tournament_string)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
## 
##     extract
matches_df <- parse_match_results(tournament_string)

4. Normalize the data into a Relational Database Model

While building this function, I learned about using non-standard evaluations. This learning has enabled me to turn some of the common steps I take into something replicable.

This is great for accessing column names that are stored as a variable. To do this, we will simply need to use !!sym([variable_name]) or in the case of the mutate function from below !!as.character([temp_id_name]).

This function may be a bit overkill for this project but will likely come in handy in the future.

normalize_datasets <- function(input_df, fields_to_normalize) { # nolint: line_length_linter.
  library(dplyr)

  # Use setdiff to find which column names do not exist in the
  # dataframe's columns
  missing_columns <- setdiff(fields_to_normalize, names(input_df))

  # If there are missing columns, stop everything and warn the user.
  if (length(missing_columns) > 0) {
    stop(
      paste(
        "Columns not found in this dataframe:\n\t",
        paste(missing_columns, collapse = "\n\t")
      )
    )
  }

  # Create a collection variable to store outputs
  dataset_colleciton <- list()

  # Iterate through the columns and create the reference dataframes
  for (each_field in fields_to_normalize) {
    # intialize an id field name
    temp_id_name <- paste(c(each_field, "_id"), collapse = "") # nolint

    # Create the reference table
    temp_ref_df <- input_df |>
      distinct(!!sym(each_field)) |>
      mutate(!!as.character(temp_id_name) := row_number())

    # perform this mapping onto input_df and mutate it to retain the id only
    input_df <- subset(
      merge(input_df, temp_ref_df, by = each_field)
      , select = (colnames(input_df)[colnames(input_df) != each_field])
    )

    # Add this reference to the dataset_collection
    dataset_colleciton[[each_field]] <- temp_ref_df

  }

  # With each column normalized, we will return the list with the
  # modified input_df
  dataset_colleciton[["input_result"]] <- input_df

  return(dataset_colleciton)
}

With the function built, let’s inspect our datasets for which fields we should use for our model:

str(players_df)
## 'data.frame':    64 obs. of  7 variables:
##  $ player_id        : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ player_name      : chr  "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ" "PATRICK H SCHILLING" ...
##  $ player_uscfid    : num  15445895 14598900 14959604 12616049 14601533 ...
##  $ player_state     : chr  "ON" "MI" "MI" "MI" ...
##  $ player_points    : num  6 6 6 5.5 5.5 5 5 5 5 5 ...
##  $ player_prerating : num  1794 1553 1384 1716 1655 ...
##  $ player_postrating: num  1817 1663 1640 1744 1690 ...
str(matches_df)
## tibble [448 × 4] (S3: tbl_df/tbl/data.frame)
##  $ player_id         : num [1:448] 1 1 1 1 1 1 1 2 2 2 ...
##  $ round_number      : num [1:448] 1 2 3 4 5 6 7 1 2 3 ...
##  $ round_outcome     : chr [1:448] "W" "W" "W" "W" ...
##  $ opponent_player_id: num [1:448] 39 21 18 14 7 12 4 63 58 4 ...

From inspecting the results of both of these, the candidates to use are:

  1. player_state
  2. round_outcome

A candidate was determined by looking at the column and seeing which ones have a one to many relationship. By looking at the str(df) results, we can see that from just the sample provided there are already a few which show up (player_state and round_outcome)

Not very many entries to be reduced. Honestly, this was not even worth reducing as it’s unecessary added complexity. So, at this point I decided to scrap the idea and just decided to continue in a simpler fashion. For the sake of completion though, I’ll include an appendix section where I go through this process.

5. Calculate the opponent average score

Thankfully, with the work we’ve done by creating matches_df, we should be able to easily obtain this by joining matches_df to players_df where opponent_player_id = player_id. With that result, we should be able to simply group by the player_id and average the opponent players’ preratings.

A question that comes up is what to do when the player had a round with no opponent, for example, a bye round. For this case, I believe it makes the most sense to not include anything for that round. I chose this method because the metric we are creating is their average opponent’s score and adding a 0 will imply that they faced an opponent with a score of 0. Luckily, by using the merge command (an inner join), we will take care of this immediately and remove any matches with a NA opponent_player_id.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
pre_rating_df <- players_df %>%
  select(player_id, player_prerating)

prerating_matches_df <- merge(
  matches_df,
  pre_rating_df,
  by.x = "opponent_player_id",
  by.y = "player_id"
)

average_opp_rating <- prerating_matches_df |>
  group_by(player_id) |>
  summarise(average_opponent_prerating = mean(player_prerating))

average_opp_rating |>
  filter(player_id == 1)
## # A tibble: 1 × 2
##   player_id average_opponent_prerating
##       <dbl>                      <dbl>
## 1         1                      1605.

Verifying with the pdf provided for this project, we can see that for Gary Hua (player_id = 1), we are calculating the “Average Pre Chess Rating of Opponents” correctly.

6. Creating the final dataset and writing the data locally

Now that we have everything, we can go ahead and create the dataset we need and save it. The dataset we need is:

  1. player_name
  2. player_state
  3. player_points
  4. player_prerating
  5. average_opponent_prerating

In order to get this information, we will need to perform a merge on the players_df and the average_opp_rating dataframes and then select the columns mentioned above:

final_df <- merge(players_df, average_opp_rating, by = "player_id") |>
  select(
    player_name,
    player_state,
    player_points,
    player_prerating,
    average_opponent_prerating
  )

head(final_df)
##           player_name player_state player_points player_prerating
## 1            GARY HUA           ON           6.0             1794
## 2     DAKSHESH DARURI           MI           6.0             1553
## 3        ADITYA BAJAJ           MI           6.0             1384
## 4 PATRICK H SCHILLING           MI           5.5             1716
## 5          HANSHI ZUO           MI           5.5             1655
## 6         HANSEN SONG           OH           5.0             1686
##   average_opponent_prerating
## 1                   1605.286
## 2                   1469.286
## 3                   1563.571
## 4                   1573.571
## 5                   1500.857
## 6                   1518.714

Now let’s write final_df to a csv locally:

write.csv(final_df, "players_average_opponent_prerating.csv", row.names = FALSE)

Conclusion

In this project, we were given a text file which held the information of chess players and the results of up to 7 rounds of playing. With the text file, we’ve developed two functions to parse this data out using regex and create two dataframes, one for players and their information and another for the rounds that was played. With this, I invesitaged to see if this data could benefit from breaking it down into more reference tables but it turned out that it there wasn’t much available although I fully intend on reusing the function in future projects.

Lastly, we calculated the average prescore of a player’s opponents and wrote that out along with the match data and the players data.

Appendix

As promised and for completion, here is the appendix where I go through the columns identified in section 4.

players_datasets <- normalize_datasets(players_df, c("player_state"))

state_ref         <- players_datasets$player_state
res_players_df    <- players_datasets$input_result

state_ref
##   player_state player_state_id
## 1           ON               1
## 2           MI               2
## 3           OH               3

Now we can see 3 states here. Perhaps it was a local tournament?

matches_datasets  <- normalize_datasets(matches_df, c("round_outcome"))

res_matches_df    <- matches_datasets$input_result
outcome_ref       <- matches_datasets$round_outcome

outcome_ref
## # A tibble: 7 × 2
##   round_outcome round_outcome_id
##   <chr>                    <int>
## 1 W                            1
## 2 D                            2
## 3 L                            3
## 4 H                            4
## 5 U                            5
## 6 B                            6
## 7 X                            7

Here we can easily see that there were 7 types of round_outcomes. Although I don’t know what they all mean, we can see that the tables are pretty normalized now. That is to say, that we have an ID for any one to many relationships.

This will be helpful in the future to do the same and create relational database models.