library(tidyverse)
library(stringr)DATA 607 Project 1 - Chess Tournament Analysis
Understanding the file first
First thing I will do is look at the text file and try to understand what I am working with. This is not a CSV or Excel file. It is just a printed chess tournament table. After watching the chess crosstable video, I understand that: 1. Each player shows up in two lines 2. First line has the player name, total points, and round results 3. Second line has the state and rating information So the file looks messy, but the structure repeats for every player.
Reading the data into R
Next, I will read the text file line by line using R. I will not try to read it as a table, because it is not formatted that way. Reading line by line makes it easier to pull out the parts I need.
Finding where player data starts
Then, I will separate the important lines from the file. I will: 1. Grab the lines that start with a number → these are player lines 2. Grab the lines that start with a state code like MI or ON → these are rating lines This helps me line up each player with their state and rating.
Extracting basic player information
After that, I will extract the basic things I need for each player: • Player name • Player state • Total points • Pre-tournament rating
I will ignore the post-rating because the project only asks for pre-rating.
Getting and Matching opponent information from rounds
Next, I will look at the round results. Each round shows the opponent’s player number, not their name or rating. So what I will do is: Pulling out the opponent numbers from each round and Storing them for each player
Then, I will create a lookup list that connects: Player number → Pre-rating
Using this, I will: 1. Match each opponent number to their pre-rating 2. Collect all opponent ratings for each player
Calculating average opponent rating
After I have all opponent ratings, I will: 1. Take the average of those ratings 2. Ignore games that were not played 3. Round the final average
This gives me the average pre-chess rating of opponents, which is required for the project.
Creating the final dataset
Finally, I will put everything together into one clean dataset with:
- Player Name
- Player State
- Total Points
- Pre-Rating
- Average Opponent Pre-Rating
I will then export this as a CSV file so it can be used in SQL or other systems.
Code Base
Reading Text File
file_url <- "https://raw.githubusercontent.com/sinemkilicdere/Data607/refs/heads/main/Project1/tournamentinfo.txt"
raw_lines <- readLines(file_url)Warning in readLines(file_url): incomplete final line found on
'https://raw.githubusercontent.com/sinemkilicdere/Data607/refs/heads/main/Project1/tournamentinfo.txt'
I use readLines() because this is a text file, not a CSV.
Identifying player and rating lines
player_lines <- raw_lines[str_detect(raw_lines, "^\\s*\\d+ \\|")]
state_lines <- raw_lines[str_detect(raw_lines, "^\\s*[A-Z]{2} \\|")]Player lines start with a number. State lines start with two capital letters like MI or ON.
Extracting player number, name, and total points
players <- tibble(
player_num = as.integer(str_extract(player_lines, "^\\s*\\d+")),
name = str_trim(str_extract(player_lines, "(?<=\\|).+?(?=\\|)")),
total_points = as.numeric(str_extract(player_lines, "(?<=\\|)\\d+\\.\\d(?=\\s*\\|)"))
)Extracting player state and pre-rating
states <- str_extract(state_lines, "(?<=^\\s{3})[A-Z]{2}")
ratings <- as.numeric(str_extract(state_lines, "(?<=R:\\s)\\d+"))
players <- players %>%
mutate(
state = states,
pre_rating = ratings
)Extracting opponent numbers from each round
get_opponents <- function(line) {
parts <- str_split(line, "\\|")[[1]]
round_parts <- parts[4:length(parts)]
# Extract numbers from each round column
opp_list <- str_extract_all(round_parts, "\\d+")
# Flatten list into single numeric vector
opp_nums <- as.integer(unlist(opp_list))
opp_nums
}
opponents_list <- lapply(player_lines, get_opponents)Each round lists the opponent’s player number.
Creating a rating lookup table
rating_lookup <- players %>%
select(player_num, pre_rating)Calculate average opponent pre-rating
avg_opponent_rating <- sapply(opponents_list, function(opp_nums) {
opp_ratings <- rating_lookup$pre_rating[
match(opp_nums, rating_lookup$player_num)
]
mean(opp_ratings, na.rm = TRUE)
})
players$avg_opponent_rating <- round(avg_opponent_rating)Final clean dataset
final_df <- players %>%
select(
Player_Name = name,
Player_State = state,
Total_Points = total_points,
Pre_Rating = pre_rating,
Avg_Opponent_Pre_Rating = avg_opponent_rating
)
final_df# A tibble: 64 × 5
Player_Name Player_State Total_Points Pre_Rating Avg_Opponent_Pre_Rat…¹
<chr> <chr> <dbl> <dbl> <dbl>
1 GARY HUA ON 6 1794 1605
2 DAKSHESH DARURI MI 6 1553 1561
3 ADITYA BAJAJ MI 6 1384 1665
4 PATRICK H SCHILL… MI 5.5 1716 1574
5 HANSHI ZUO MI 5.5 1655 1588
6 HANSEN SONG OH 5 1686 1519
7 GARY DEE SWATHELL MI 5 1649 1538
8 EZEKIEL HOUGHTON MI 5 1641 1468
9 STEFANO LEE ON 5 1411 1635
10 ANVIT RAO MI 5 1365 1554
# ℹ 54 more rows
# ℹ abbreviated name: ¹Avg_Opponent_Pre_Rating
write_csv(final_df, "chess_tournament_clean.csv")Sources
- Chess Crosstable Explanation Video https://youtu.be/T5PXYl2FEUo
- FiveThirtyEight Elo Ratings https://fivethirtyeight.com/tag/elo-ratings/