Given the Set data set of a Chess Tournament Result I will create an R Markdown file that generates a .CSV file with the following information for all of the players: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Opponents ELO.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
First Step is to import the .txt file into Rstudio, for futher practice I uploaded the given data set into Github. Since the main focus of this project is to use Regular Expression thats how most if not all the data will be transfered and cleaned. Initally we see that the .txt file imported into a messy string so for the following Desired Columns and information the following regular expression statements can be shown.
Original <- read_lines("https://raw.githubusercontent.com/Jlok17/Data-Science-Projects/main/Project%201%20607.txt")
head(Original)
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
#This Regular Expression statement matches 1 or 2 consecutive digits that come after 3 or 4 whitespace characters followed by a single whitespace character.
Player_Number <- as.numeric(unlist(str_extract_all(Original,"(?<=\\s{3,4})\\d{1,2}(?=\\s)")))
#This Regular Expression statement matches a string of text that comes after a digit, a whitespace character and a "|", and it ends with a aphabetic character immediately before a whitespace character and a "|".
Player_Name <- unlist(str_extract_all(Original,"(?<=\\d\\s\\|\\s)([A-z, -]*\\s){1,}[[:alpha:]]*(?=\\s*\\|)"))
#This Regular Expression statement matches 2 consecutive uppercase letters that are followed by a whitespace and a "|"
State <- unlist(str_extract_all(Original, "[[:upper:]]{2}(?=\\s\\|)"))
#This Regular Expression statement matches a single digit,period and another single digit that are preceded by "|".
Pt_Total <- as.numeric(unlist(str_extract_all(Original, "(?<=\\|)\\d\\.\\d")))
#This Regular Expression statement matches 3 or 4 digit numbers that follow "R:" and 1-2 whitespaces or 3 or 4 digit numbers that precede "P", 1-2 digits, whitespace, and a hyphen character.
Pre_Rating <- as.numeric(unlist(str_extract_all(Original, "(?<=R:\\s{1,2})(\\d{3,4}(?=\\s))|(\\d{3,4}(?=P\\d{1,2}\\s*-))")))
#Combining all of my Desired Data into a Data frame except Average ELO of Opponent
df <- data.frame(Player_Number, Player_Name, State, Pt_Total, Pre_Rating)
head(df)
## Player_Number Player_Name State Pt_Total Pre_Rating
## 1 1 GARY HUA ON 6.0 1794
## 2 2 DAKSHESH DARURI MI 6.0 1553
## 3 3 ADITYA BAJAJ MI 6.0 1384
## 4 4 PATRICK H SCHILLING MI 5.5 1716
## 5 5 HANSHI ZUO MI 5.5 1655
## 6 6 HANSEN SONG OH 5.0 1686
This next step, using Regular Expression Statement, I will first collecting the Player Number’s of each of the Opponents each Player has “Played”. Since there are 7 rounds, it is expected the data to be in groups of 7 in order to distinguish the opponents for each player so the 1-7 dataset would be first player Gary Hua. We do have to account for any missing opponents as in chess, there is sometimes a person missing in a round or there are no shows which will show in the results as blank. To do these calculations, I decided it would be best to do a For Loop that would calculate the Sum/mean of Opponents ELO per a player.
Match_History <- Original[seq(5, 196, 3)]
#This Regular Expression statement matches 2 types of patterns, one being that if the pattern looks like a |,W,L,D, 2 or 3 whitespace characters , 1 or 2 consecutive digits, "|". Or another pattern that looks like ""|","U","H","B", or "X", 4 whitespace characters, "|"".
Opponents <- as.numeric(unlist(str_extract_all(Match_History, "(?<=\\|(W|L|D)\\s{2,3})[[:digit:]]{1,2}(?=\\|)|((?<!->)(?<=\\|(U|H|B|X))\\s{4}(?=\\|))")))
head(Opponents)
## [1] 39 21 18 14 7 12
pcr_matrix <- matrix(data = NA, nrow = 64, ncol = 2)
# Assign readable names for the matrix
colnames(pcr_matrix) <- c("Total Opponent ELO", "Average Opponent ELO")
# Initialize a variable to be used as a counter in the for loop to fill the corresponding matrix row
row_counter <- 0
# Start of for loop
for(i in seq(from=1, to=length(Opponents)-6, by=7)){
row_counter <- row_counter + 1
# Perform a lookup of each competitor's score based on their player number and add them up for each row (corresponding to each sequence of 7 data points, with value from for loop serving as row 'anchor')
pcr_matrix[row_counter, 1] <- sum(subset(df$Pre_Rating, df$Player_Number %in% Opponents[seq(from=i, to=i+6, by=1)]))
# Calculate the average ELO for each row, excluding missing entries
pcr_matrix[row_counter, 2] <- mean(subset(df$Pre_Rating, df$Player_Number %in% Opponents[seq(from=i, to=i+6, by=1)], na.rm = TRUE))
}
This is a quick Check to see if we have the desired results and that the function and Regular Expression Statements were correct and didn’t have any errors.
head(pcr_matrix)
## Total Opponent ELO Average Opponent ELO
## [1,] 11237 1605.286
## [2,] 10285 1469.286
## [3,] 10945 1563.571
## [4,] 11015 1573.571
## [5,] 10506 1500.857
## [6,] 10631 1518.714
Since seen above/prior that the results are in separate matrix/Dataframs now it is time to add them together to get one DataFrame with all the desired Information.
#Adding the Average Opponent ELO to the dataframe with the desired data
df <- cbind(df, pcr_matrix[,2])
#Renaming the heading of the Average Opponent ELO to have clarity
df <- rename(df, AVG_ELO = 'pcr_matrix[, 2]')
head(df)
## Player_Number Player_Name State Pt_Total Pre_Rating
## 1 1 GARY HUA ON 6.0 1794
## 2 2 DAKSHESH DARURI MI 6.0 1553
## 3 3 ADITYA BAJAJ MI 6.0 1384
## 4 4 PATRICK H SCHILLING MI 5.5 1716
## 5 5 HANSHI ZUO MI 5.5 1655
## 6 6 HANSEN SONG OH 5.0 1686
## AVG_ELO
## 1 1605.286
## 2 1469.286
## 3 1563.571
## 4 1573.571
## 5 1500.857
## 6 1518.714
write.csv(df, file.path(getwd(), "Chess_Data.csv"))
Through the Use of Regular Expression Statements, For Loops, seq() function and the Library “tidyverse”. The conversion of the .TXT messy Chess Tournament Result was cleaned and transformed into a dataframe. That was ultimately turned into a .csv file intended for a SQL server/database.