Project1v2

Overview

The purpose of this document is to systematically import a text file containing data from a chess tournament cross table. The below R code uses string manipulations and extractions in order to clean up the text file and extract relevant information formatted in a clean, reusable, and exported dataframe with the goal of writing the newly created table to a CSV file for easy use.

library(stringr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Import and Clean

First you’ll need to import the raw data from my github repo. You can find a link to the data here: <https://raw.githubusercontent.com/awrubes/Project-1-v2/main/tournamentinfo.txt>. Instead of reading the file directly into a structured format, the data is read in as a string vector to allow for specific string manipulation. This allows us to parse and clean the data in subsequent steps.

#import chess data as txt file
chess_data <- suppressWarnings(readLines("https://raw.githubusercontent.com/awrubes/Project-1-v2/main/tournamentinfo.txt"
))
head(chess_data)

## [1] "-----------------------------------------------------------------------------------------" 
## [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [4] "-----------------------------------------------------------------------------------------" 
## [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

chess_df <-tibble(raw=chess_data)

head(chess_df)

## # A tibble: 6 × 1
##   raw                                                                           
##   <chr>                                                                         
## 1 "----------------------------------------------------------------------------…
## 2 " Pair | Player Name                     |Total|Round|Round|Round|Round|Round…
## 3 " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  …
## 4 "----------------------------------------------------------------------------…
## 5 "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7…
## 6 "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    …

Next, we’ll want to clean up the text file so that it’s easier to parse. We’ll do this by getting rid of the header rows and removing the “—” row separators. This way the only text in the file will be the data we want to selectively extract.

chess_df_clean <- chess_df %>%
  filter(!grepl("^-+$", raw),
          !grepl("^\\s*(Pair|Num)", raw))


head(chess_df_clean)

## # A tibble: 6 × 1
##   raw                                                                           
##   <chr>                                                                         
## 1 "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7…
## 2 "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    …
## 3 "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16…
## 4 "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    …
## 5 "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11…
## 6 "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    …

Once the unnecessary rows are removed, we proceed to further structure the dataset. Each player’s data spans two rows—one for their general tournament information and another for their rating information. We separate these rows using the mutate() and pivot_wider() functions, which allow us to organize the data into separate columns for easier extraction.

chess_df_clean <-chess_df_clean %>%
  mutate(row_num = row_number()) %>%
  mutate(info_type = ifelse(row_num %%2 == 1, "player_info", "player_rating_info")) %>%
  group_by(info_type) %>%
    mutate(group = row_number())%>%
    pivot_wider(names_from = info_type, values_from = raw) %>%
    select(-group)
  
chess_df_clean <- chess_df_clean %>%
  mutate(
    player_rating_info = ifelse(is.na(player_rating_info), lead(player_rating_info), player_info)
       )%>%
  filter(!is.na(player_info), !is.na(player_rating_info))
  
head(chess_df_clean)

## # A tibble: 6 × 3
##   row_num player_info                                         player_rating_info
##     <int> <chr>                                               <chr>             
## 1       1 "    1 | GARY HUA                        |6.0  |W … "   ON | 15445895…
## 2       3 "    2 | DAKSHESH DARURI                 |6.0  |W … "   MI | 14598900…
## 3       5 "    3 | ADITYA BAJAJ                    |6.0  |L … "   MI | 14959604…
## 4       7 "    4 | PATRICK H SCHILLING             |5.5  |W … "   MI | 12616049…
## 5       9 "    5 | HANSHI ZUO                      |5.5  |W … "   MI | 14601533…
## 6      11 "    6 | HANSEN SONG                     |5.0  |W … "   OH | 15055204…

Extracting Data

With the data structured, we now extract specific columns that contain player names, state information, total points, pre-tournament ratings, and games played. String extraction functions (str_extract) are applied to extract relevant data from each row. The mutate() function is used to create new columns that hold this information.

We also calculate the average rating of opponents each player faced, using a combination of lapply() to extract the IDs of opponents and sapply() to compute their average ratings based on the opponent ratings in the data set.

chess_df_clean <- chess_df_clean %>%
  mutate(player_name = str_extract(player_info, "(?<=\\|)\\s*[A-Z\\s]+(?=\\s*\\|)"))%>%
  mutate(state = trimws(str_extract(player_rating_info, "^\\s*[A-Z]{2}\\b")))%>%
  mutate(total_points = as.numeric(str_extract(player_info, "(?<=|)\\d\\.\\d")))%>%
  mutate(pre_rating = as.numeric(str_extract(player_rating_info, "(?<=R:)\\s*(\\d+)(?=\\s*->|\\w)")))%>%
  mutate(games_played = as.numeric(str_count(player_info, "\\b(W|L|D){1}\\s*\\d+")))%>%
  mutate(
    opponent_ids = lapply(player_info, function(info){
      as.numeric(trimws(unlist(str_extract_all(info, "(?<=\\b[WLD])\\s*\\d+"))))
    })
  )%>%
  mutate(
    avg_op_rating=sapply(opponent_ids, function(ids){
      
      opponent_ratings <- .data$pre_rating[ids]
      
      total_opponent_ratings <- sum(opponent_ratings, na.rm = TRUE)
      
      avg_op_rating <- total_opponent_ratings / length(ids)
      return(avg_op_rating)
      
    })
  )

head(chess_df_clean)

## # A tibble: 6 × 10
##   row_num player_info          player_rating_info player_name state total_points
##     <int> <chr>                <chr>              <chr>       <chr>        <dbl>
## 1       1 "    1 | GARY HUA  … "   ON | 15445895… " GARY HUA… ON             6  
## 2       3 "    2 | DAKSHESH D… "   MI | 14598900… " DAKSHESH… MI             6  
## 3       5 "    3 | ADITYA BAJ… "   MI | 14959604… " ADITYA B… MI             6  
## 4       7 "    4 | PATRICK H … "   MI | 12616049… " PATRICK … MI             5.5
## 5       9 "    5 | HANSHI ZUO… "   MI | 14601533… " HANSHI Z… MI             5.5
## 6      11 "    6 | HANSEN SON… "   OH | 15055204… " HANSEN S… OH             5  
## # ℹ 4 more variables: pre_rating <dbl>, games_played <dbl>,
## #   opponent_ids <list>, avg_op_rating <dbl>

This step produces a clean data set that includes critical player statistics such as total points, pre-tournament ratings, the number of games played, and the average rating of their opponents.

Final Data and Export

After extracting the relevant information, the final step is to create a dataframe that includes only the necessary columns: player name, state, total points, pre-tournament rating, and average opponent rating. This dataframe is then exported to a CSV file for future use.

selected_columns <- chess_df_clean[, c("player_name", "state", "total_points", "pre_rating", "avg_op_rating")]

head(selected_columns)

## # A tibble: 6 × 5
##   player_name                        state total_points pre_rating avg_op_rating
##   <chr>                              <chr>        <dbl>      <dbl>         <dbl>
## 1 " GARY HUA                       … ON             6         1794         1605.
## 2 " DAKSHESH DARURI                … MI             6         1553         1469.
## 3 " ADITYA BAJAJ                   … MI             6         1384         1564.
## 4 " PATRICK H SCHILLING            … MI             5.5       1716         1574.
## 5 " HANSHI ZUO                     … MI             5.5       1655         1501.
## 6 " HANSEN SONG                    … OH             5         1686         1519.

The resulting CSV file contains a clean and reusable dataset that can be used for further analysis or reporting. The streamlined approach ensures that the data is both accurate and well-organized for future use.

write.csv(selected_columns, file = "chess_tournament_selected_columns.csv", row.names = FALSE)

Project1v2

aw

2024-09-29

Overview

Import and Clean

Extracting Data

Final Data and Export