Introduction

Given the Set data set of a Chess Tournament Result I will create an R Markdown file that generates a .CSV file with the following information for all of the players: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Opponents ELO.

Packages/Library Needed

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Importing and Cleaning Data

First Step is to import the .txt file into Rstudio, for futher practice I uploaded the given data set into Github. Since the main focus of this project is to use Regular Expression thats how most if not all the data will be transfered and cleaned. Initally we see that the .txt file imported into a messy string so for the following Desired Columns and information the following regular expression statements can be shown.

Original <- read_lines("https://raw.githubusercontent.com/Jlok17/Data-Science-Projects/main/Project%201%20607.txt")
head(Original)
## [1] "-----------------------------------------------------------------------------------------" 
## [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [4] "-----------------------------------------------------------------------------------------" 
## [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
#This Regular Expression statement matches 1 or 2 consecutive digits that come after 3 or 4 whitespace characters followed by a single whitespace character.
Player_Number <- as.numeric(unlist(str_extract_all(Original,"(?<=\\s{3,4})\\d{1,2}(?=\\s)")))

#This Regular Expression statement matches a string of text that comes after a digit, a whitespace character and a "|", and it ends with a aphabetic character immediately before a whitespace character and a "|".
Player_Name <- unlist(str_extract_all(Original,"(?<=\\d\\s\\|\\s)([A-z, -]*\\s){1,}[[:alpha:]]*(?=\\s*\\|)"))

#This Regular Expression statement matches 2 consecutive uppercase letters that are followed by a whitespace and a "|"
State <- unlist(str_extract_all(Original, "[[:upper:]]{2}(?=\\s\\|)"))

#This Regular Expression statement matches a single digit,period and another single digit that are preceded by "|".
Pt_Total <- as.numeric(unlist(str_extract_all(Original, "(?<=\\|)\\d\\.\\d")))

#This Regular Expression statement matches 3 or 4 digit numbers that follow "R:" and 1-2 whitespaces or 3 or 4 digit numbers that precede "P", 1-2 digits, whitespace, and a hyphen character.
Pre_Rating <- as.numeric(unlist(str_extract_all(Original, "(?<=R:\\s{1,2})(\\d{3,4}(?=\\s))|(\\d{3,4}(?=P\\d{1,2}\\s*-))")))

#Combining all of my Desired Data into a Data frame except Average ELO of Opponent
df <- data.frame(Player_Number, Player_Name, State, Pt_Total, Pre_Rating)
head(df)
##   Player_Number                      Player_Name State Pt_Total Pre_Rating
## 1             1 GARY HUA                            ON      6.0       1794
## 2             2 DAKSHESH DARURI                     MI      6.0       1553
## 3             3 ADITYA BAJAJ                        MI      6.0       1384
## 4             4 PATRICK H SCHILLING                 MI      5.5       1716
## 5             5 HANSHI ZUO                          MI      5.5       1655
## 6             6 HANSEN SONG                         OH      5.0       1686

Obtaining the Information: Average Opponent ELO

This next step, using Regular Expression Statement, I will first collecting the Player Number’s of each of the Opponents each Player has “Played”. Since there are 7 rounds, it is expected the data to be in groups of 7 in order to distinguish the opponents for each player so the 1-7 dataset would be first player Gary Hua. We do have to account for any missing opponents as in chess, there is sometimes a person missing in a round or there are no shows which will show in the results as blank. To do these calculations, I decided it would be best to do a For Loop that would calculate the Sum/mean of Opponents ELO per a player.

Match_History <- Original[seq(5, 196, 3)]

#This Regular Expression statement matches 2 types of patterns, one being that if the pattern looks like a |,W,L,D, 2 or 3 whitespace characters , 1 or 2 consecutive digits, "|". Or another pattern that looks like ""|","U","H","B", or "X", 4 whitespace characters, "|"".
Opponents <- as.numeric(unlist(str_extract_all(Match_History, "(?<=\\|(W|L|D)\\s{2,3})[[:digit:]]{1,2}(?=\\|)|((?<!->)(?<=\\|(U|H|B|X))\\s{4}(?=\\|))")))
head(Opponents)
## [1] 39 21 18 14  7 12
pcr_matrix <- matrix(data = NA, nrow = 64, ncol = 2)

# Assign readable names for the matrix
colnames(pcr_matrix) <- c("Total Opponent ELO", "Average Opponent ELO")

# Initialize a variable to be used as a counter in the for loop to fill the corresponding matrix row
row_counter <- 0

# Start of for loop
for(i in seq(from=1, to=length(Opponents)-6, by=7)){
  row_counter <- row_counter + 1
  
  # Perform a lookup of each competitor's score based on their player number and add them up for each row (corresponding to each sequence of 7 data points, with value from for loop serving as row 'anchor')
  pcr_matrix[row_counter, 1] <- sum(subset(df$Pre_Rating, df$Player_Number %in% Opponents[seq(from=i, to=i+6, by=1)]))
  
  # Calculate the average ELO for each row, excluding missing entries
  pcr_matrix[row_counter, 2] <- mean(subset(df$Pre_Rating, df$Player_Number %in% Opponents[seq(from=i, to=i+6, by=1)], na.rm = TRUE))
}

Checking Process

This is a quick Check to see if we have the desired results and that the function and Regular Expression Statements were correct and didn’t have any errors.

head(pcr_matrix)
##      Total Opponent ELO Average Opponent ELO
## [1,]              11237             1605.286
## [2,]              10285             1469.286
## [3,]              10945             1563.571
## [4,]              11015             1573.571
## [5,]              10506             1500.857
## [6,]              10631             1518.714

Combining Data to Desired Results

Since seen above/prior that the results are in separate matrix/Dataframs now it is time to add them together to get one DataFrame with all the desired Information.

#Adding the Average Opponent ELO to the dataframe with the desired data
df <- cbind(df, pcr_matrix[,2])

#Renaming the heading of the Average Opponent ELO to have clarity
df <- rename(df, AVG_ELO = 'pcr_matrix[, 2]')
head(df)
##   Player_Number                      Player_Name State Pt_Total Pre_Rating
## 1             1 GARY HUA                            ON      6.0       1794
## 2             2 DAKSHESH DARURI                     MI      6.0       1553
## 3             3 ADITYA BAJAJ                        MI      6.0       1384
## 4             4 PATRICK H SCHILLING                 MI      5.5       1716
## 5             5 HANSHI ZUO                          MI      5.5       1655
## 6             6 HANSEN SONG                         OH      5.0       1686
##    AVG_ELO
## 1 1605.286
## 2 1469.286
## 3 1563.571
## 4 1573.571
## 5 1500.857
## 6 1518.714

Turning the Data Frame into a .CSV File

write.csv(df, file.path(getwd(), "Chess_Data.csv"))

Conclusion

Through the Use of Regular Expression Statements, For Loops, seq() function and the Library “tidyverse”. The conversion of the .TXT messy Chess Tournament Result was cleaned and transformed into a dataframe. That was ultimately turned into a .csv file intended for a SQL server/database.