url <- "https://raw.githubusercontent.com/dyc-sps/SPS_Data607_Week4/refs/heads/main/tournamentinfo.txt"SPS_DATA607_Week4_DC
Project 1
In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents For the first player,
the information would be:
Gary Hua, ON, 6.0, 1794, 1605
1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played. If you have questions about the meaning of the data or the results, please post them on the discussion forum. Data science, like chess, is a game of back and forth… The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments.
Approach
Upload the text file to GitHub and load it into R. Examine the file structure. It seems that each row has a fixed number of characters, similar to a COBOL printout.
For me, I need to review string operations in R. The first two rows are headers, and each record spans two lines of text. Each field is separated using the “|” pipe character. and records are separated using “-”.
Converting this to a data frame will result in many columns.
Running Code
Load and install library readr
# Load required library
library(readr)
# GitHub raw file URL
url <- "https://raw.githubusercontent.com/dyc-sps/SPS_Data607_Week4/refs/heads/main/tournamentinfo.txt"
# Read all lines
lines <- readLines(url, warn=FALSE)
# Skip first 3 lines for headers
header_lines <- lines[1:3]
#print(header_lines)
data_lines <- lines[-c(1,2,3)]
#print(data_lines)
# Remove separator lines (lines that contain only "-")
#data_lines <- data_lines[data_lines != "-"]
data_lines <- data_lines[!grepl("^[-]+$", data_lines)]
header_lines <- header_lines[!grepl("^[-]+$", header_lines)]
#print(data_lines)
# Number of records (2 lines per record)
n <- length(data_lines) / 2
# Initialize list to store records
records <- list()
# Process each record
for (i in seq_len(n)) {
line1 <- strsplit(data_lines[2*i - 1], "\\|")[[1]] |> trimws()
line2 <- strsplit(data_lines[2*i], "\\|")[[1]] |> trimws()
# Combine both lines into a single record
record <- c(line1, line2)
records[[i]] <- record
}
# Convert to data frame
df <- do.call(rbind, records)
df <- as.data.frame(df, stringsAsFactors = FALSE)
# Use header lines for column names
print(header_lines[1])[1] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
header1 <- strsplit(header_lines[1], "\\|")[[1]] |> trimws()
header1 <- header1[-length(header1)]
#print(header1)
header2 <- strsplit(header_lines[2], "\\|")[[1]] |> trimws()
header2 <- header2[-length(header2)]
#print(header2)
colnames(df) <- c(header1, header2)
# Save as CSV
write.csv(df, "tournamentinfo.csv", row.names = FALSE)
# View first few rows
head(df) Pair Player Name Total Round Round Round Round Round Round Round Num
1 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4 ON
2 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7 MI
3 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11 W 13 W 12 MI
4 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26 D 5 W 19 D 1 MI
5 5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13 D 4 W 14 W 17 MI
6 6 HANSEN SONG 5.0 W 34 D 29 L 11 W 35 D 10 W 27 W 21 OH
USCF ID / Rtg (Pre->Post) Pts 1 2 3 4 5 6 7
1 15445895 / R: 1794 ->1817 N:2 W B W B W B W
2 14598900 / R: 1553 ->1663 N:2 B W B W B W B
3 14959604 / R: 1384 ->1640 N:2 W B W B W B W
4 12616049 / R: 1716 ->1744 N:2 W B W B W B B
5 14601533 / R: 1655 ->1690 N:2 B W B W B W B
6 15055204 / R: 1686 ->1687 N:3 W B W B B W B
Load library stringr to process string.
library(stringr)
df$R_number <- as.numeric(str_match(df[,12], "R:\\s*(\\d+)")[,2])
head(df) Pair Player Name Total Round Round Round Round Round Round Round Num
1 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4 ON
2 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7 MI
3 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11 W 13 W 12 MI
4 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26 D 5 W 19 D 1 MI
5 5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13 D 4 W 14 W 17 MI
6 6 HANSEN SONG 5.0 W 34 D 29 L 11 W 35 D 10 W 27 W 21 OH
USCF ID / Rtg (Pre->Post) Pts 1 2 3 4 5 6 7 R_number
1 15445895 / R: 1794 ->1817 N:2 W B W B W B W 1794
2 14598900 / R: 1553 ->1663 N:2 B W B W B W B 1553
3 14959604 / R: 1384 ->1640 N:2 W B W B W B W 1384
4 12616049 / R: 1716 ->1744 N:2 W B W B W B B 1716
5 14601533 / R: 1655 ->1690 N:2 B W B W B W B 1655
6 15055204 / R: 1686 ->1687 N:3 W B W B B W B 1686
R:→ skip everything beforeR:\\s*→ optional spaces(\\d+)→ capture the digits[,2]→ extract captured digits leave out first 2 chrs
df$avg_pre_r <- NA
for (row_num in 1:nrow(df)){
pre_r_num=0
na_num=0
for (col in 4:10){
# from col 4 to col 10 , remove letter "W" "L" or others strings.
pair_num=str_match(df[,col], " \\s*(\\d+)")[,2]
# add row_num to only targeting single number and is.na function to work.
if(is.na(df[pair_num,"R_number"][row_num])) {
na_num<-na_num+1
} else {
pre_r_num<-pre_r_num+df[pair_num,"R_number"][row_num]}
#print(df[pair_num,"R_number"][row_num])
}
#print(pre_r_num)
#print(na_num)
if (na_num ==0){ played_num = 7}else{played_num <- 7- na_num}
avg_pre_r_num=round(pre_r_num/played_num,0)
df[row_num,"avg_pre_r"]<-avg_pre_r_num
#print(avg_pre_r_num)
}Basically finished all math calculation and reload to a new data set and export to a CSV file.
newdf<-df[,c(1,2,11,3,21,22)]
colnames(newdf)<- c("Pair_num","Player_Name","Player_State","Total_Pts","Pre_Rating","Average_Pre_Rating")
head(newdf) Pair_num Player_Name Player_State Total_Pts Pre_Rating
1 1 GARY HUA ON 6.0 1794
2 2 DAKSHESH DARURI MI 6.0 1553
3 3 ADITYA BAJAJ MI 6.0 1384
4 4 PATRICK H SCHILLING MI 5.5 1716
5 5 HANSHI ZUO MI 5.5 1655
6 6 HANSEN SONG OH 5.0 1686
Average_Pre_Rating
1 1605
2 1469
3 1564
4 1574
5 1501
6 1519
path <- getwd() #set to current working directory path
write.csv(newdf, file.path(path, "project1.csv"))Conclusion
This is similar to string processing in Python, though we need to examine the library for details. Leveraging ChatGPT for this type of task can save significant time when trying and fixing issues.
LLMS used:
• OpenAI. (2025). ChatGPT (Version 5.2) [Large language model]. https://chat.openai.com. Accessed Feb 22, 2026.