Chess Text File

Introduction

The objective is to import a text file containing the results of a chess tournament, extract the information needed, and export to a CSV file that could be used in a database.

A preview of the text file being used

Data Dictionary

Column	Description
player_name	player’s full name
player_state	player’s state
total_pts	total points
pre-rating	player’s pre-rating entering the tournament
avg_opp_rating	average pre-rating of opponents entering the tournament

Required Libraries

library(tidyverse)
library(stringr)
library(ggrepel)

Read Text File

This will import the text file, reading each line individually and appending it to a DataFrame.

df <- data.frame()

con = file('data.txt', "r")
while (TRUE) {
  line = readLines(con, n = 1)
  
  if (length(line) == 0) {
     break
  }
  
  df <- bind_rows(df, data.frame(line))
}
close(con)

The output of the DataFrame looks like this.

line
—————————————————————————————–
Pair \| Player Name \|Total\|Round\|Round\|Round\|Round\|Round\|Round\|Round\|
Num \| USCF ID / Rtg (Pre->Post) \| Pts \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \|
—————————————————————————————–
1 \| GARY HUA \|6.0 \|W 39\|W 21\|W 18\|W 14\|W 7\|D 12\|D 4\|
ON \| 15445895 / R: 1794 ->1817 \|N:2 \|W \|B \|W \|B \|W \|B \|W \|

Drop Non-Alphanumeric Rows

The next step is to remove rows in the DataFrame that does not provide any useful information at all, such as rows only containing “—”. Using str_detect(), I looked for rows containing at least some form of alphanumeric values within it.

# find rows that has alpha-numeric characters
for (row in df){
  df$str_alpha <- str_detect(row, '[a-zA-Z0-9]')
}

We can see which rows were alphanumeric and which ones were not

line	str_alpha
—————————————————————————————–	FALSE
Pair \| Player Name \|Total\|Round\|Round\|Round\|Round\|Round\|Round\|Round\|	TRUE
Num \| USCF ID / Rtg (Pre->Post) \| Pts \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \|	TRUE
—————————————————————————————–	FALSE
1 \| GARY HUA \|6.0 \|W 39\|W 21\|W 18\|W 14\|W 7\|D 12\|D 4\|	TRUE
ON \| 15445895 / R: 1794 ->1817 \|N:2 \|W \|B \|W \|B \|W \|B \|W \|	TRUE

Now lets drop these rows and the header rows we do not need

# drop non-alpha-numeric rows
df_alpha_num <- df %>% 
  filter(., str_alpha == TRUE) %>%
  select(., line)

# drop first two rows containing header info
df_alpha_num <- df_alpha_num %>%
  slice(-c(1, 2))

Here is the output contaning only the player information

line
1 \| GARY HUA \|6.0 \|W 39\|W 21\|W 18\|W 14\|W 7\|D 12\|D 4\|
ON \| 15445895 / R: 1794 ->1817 \|N:2 \|W \|B \|W \|B \|W \|B \|W \|
2 \| DAKSHESH DARURI \|6.0 \|W 63\|W 58\|L 4\|W 17\|W 16\|W 20\|W 7\|
MI \| 14598900 / R: 1553 ->1663 \|N:2 \|B \|W \|B \|W \|B \|W \|B \|
3 \| ADITYA BAJAJ \|6.0 \|L 8\|W 61\|W 25\|W 21\|W 11\|W 13\|W 12\|
MI \| 14959604 / R: 1384 ->1640 \|N:2 \|W \|B \|W \|B \|W \|B \|W \|

Extract Player Data

Every two rows contains a individual’s tournament information. The DataFrame will be parsed between “|” characters and added into individual columns for each player. The columns of interest will then be selected and placed into a new DataFrame.

# create empty dataframe
player_df <- data.frame()

# loop through each row 
for (i in seq(1, nrow(df_alpha_num), by = 2)){
  
  # read every odd row for player id, name and total points
  row_info <- str_split_fixed(df_alpha_num[i, ],regex("|", literal=TRUE),n=Inf)
  df_temp <- data.frame(row_info)
  temp_cols <- c("player_id", "player_name", "total_pts", "round_1", "round_2", "round_3", "round_4", "round_5", "round_6", "round_7")
  colnames(df_temp) = temp_cols
  df_export <- df_temp %>% 
      select(1:10)
  
  # read every even row to add additional columns for player info state and pre-rating
  row_info <- str_split_fixed(df_alpha_num[i+1, ],regex("|", literal=TRUE),n=Inf)
  df_temp <- data.frame(row_info)
  temp_cols <- c("player_state", "pre_rating")
  colnames(df_temp) = temp_cols
  df_export <- bind_cols(df_export, df_temp %>% 
      select(1:2))
  
  # extract individual player info to all players dataframe
  player_df <- bind_rows(player_df, df_export)
}

Here we can see an example of how the information was separated

player_state	pre_rating	NA	NA	NA	NA	NA	NA	NA	NA	NA
MI	15006561 / R: 1163 ->1112		B	W	W	B	W	B	B

Finally, we get a DataFrame that begins to look more manageable to work with

player_id	player_name	total_pts	round_1	round_2	round_3	round_4	round_5	round_6	round_7	player_state	pre_rating
1	GARY HUA	6.0	W 39	W 21	W 18	W 14	W 7	D 12	D 4	ON	15445895 / R: 1794 ->1817
2	DAKSHESH DARURI	6.0	W 63	W 58	L 4	W 17	W 16	W 20	W 7	MI	14598900 / R: 1553 ->1663
3	ADITYA BAJAJ	6.0	L 8	W 61	W 25	W 21	W 11	W 13	W 12	MI	14959604 / R: 1384 ->1640
4	PATRICK H SCHILLING	5.5	W 23	D 28	W 2	W 26	D 5	W 19	D 1	MI	12616049 / R: 1716 ->1744
5	HANSHI ZUO	5.5	W 45	W 37	D 12	D 13	D 4	W 14	W 17	MI	14601533 / R: 1655 ->1690
6	HANSEN SONG	5.0	W 34	D 29	L 11	W 35	D 10	W 27	W 21	OH	15055204 / R: 1686 ->1687

Clean Strings

We can see that some of the columns need to be further cleaned because they contain more information than needed. The next step was removing the decision such as “W” for win, and solely keeping the numeric value as it tells us the player ID of the person they played that round.

for (i in 4:10){
  player_df[, i] <- gsub("[^0-9]", "", player_df[, i])
}

Next, the player’s pre-rating is right after the “:” but before “<-”. For example, in “14959604 / R: 1384 ->1640”, 1384 would be the player’s pre-rating. The number will be extracted using parse_number()

player_df <- player_df %>% 
  separate_wider_delim(pre_rating, delim = ": ", names = c('rtg_1', 'rtg_2'))

player_df$pre_rating <- player_df$rtg_2 %>% parse_number()

This will remove preceding and following whitespaces in a string throughout the DataFrame.

player_df <- player_df %>% 
  mutate(across(where(is.character), str_trim))

Finally, the columns data types were change where appropriately needed.

change_type <- c("player_id", "total_pts", "round_1", "round_2", "round_3", "round_4", "round_5", "round_6", "round_7", "pre_rating")
player_df <- player_df %>% 
  mutate_at(change_type, as.numeric)

The DataFrame now looks like this.

player_id	player_name	total_pts	round_1	round_2	round_3	round_4	round_5	round_6	round_7	player_state	rtg_1	rtg_2	pre_rating
1	GARY HUA	6.0	39	21	18	14	7	12	4	ON	15445895 / R	1794 ->1817	1794
2	DAKSHESH DARURI	6.0	63	58	4	17	16	20	7	MI	14598900 / R	1553 ->1663	1553
3	ADITYA BAJAJ	6.0	8	61	25	21	11	13	12	MI	14959604 / R	1384 ->1640	1384
4	PATRICK H SCHILLING	5.5	23	28	2	26	5	19	1	MI	12616049 / R	1716 ->1744	1716
5	HANSHI ZUO	5.5	45	37	12	13	4	14	17	MI	14601533 / R	1655 ->1690	1655
6	HANSEN SONG	5.0	34	29	11	35	10	27	21	OH	15055204 / R	1686 ->1687	1686

Determine Opponents Average Pre-Ratings

As mentioned before, each round column contains the opponent they played. Now, that opponent’s player ID will be replaced with the opponent’s pre-rating instead. This is so we can calculate the opponent’s average pre-rating for the player.

for (i in 1:nrow(player_df)){
  for (j in 4:10){
    if (!is.na(player_df[[i,j]])){
      player_df[i,j] <- player_df %>% 
        filter(., player_id == player_df[[i, j]]) %>% 
        select(pre_rating)
    }
  }
}

player_id	player_name	total_pts	round_1	round_2	round_3	round_4	round_5	round_6	round_7	player_state	rtg_1	rtg_2	pre_rating
1	GARY HUA	6.0	1436	1563	1600	1610	1649	1663	1716	ON	15445895 / R	1794 ->1817	1794
2	DAKSHESH DARURI	6.0	1175	917	1716	1629	1604	1595	1649	MI	14598900 / R	1553 ->1663	1553
3	ADITYA BAJAJ	6.0	1641	955	1745	1563	1712	1666	1663	MI	14959604 / R	1384 ->1640	1384
4	PATRICK H SCHILLING	5.5	1363	1507	1553	1579	1655	1564	1794	MI	12616049 / R	1716 ->1744	1716
5	HANSHI ZUO	5.5	1242	980	1663	1666	1716	1610	1629	MI	14601533 / R	1655 ->1690	1655
6	HANSEN SONG	5.0	1399	1602	1712	1438	1365	1552	1563	OH	15055204 / R	1686 ->1687	1686

Here, we finally calculate the average opponent pre-rating

player_df <- player_df %>% 
  mutate(avg_opp_pre_rating = round(rowMeans(pick(round_1:round_7), na.rm=TRUE), 2))

player_id	player_name	total_pts	round_1	round_2	round_3	round_4	round_5	round_6	round_7	player_state	rtg_1	rtg_2	pre_rating	avg_opp_pre_rating
1	GARY HUA	6.0	1436	1563	1600	1610	1649	1663	1716	ON	15445895 / R	1794 ->1817	1794	1605.29
2	DAKSHESH DARURI	6.0	1175	917	1716	1629	1604	1595	1649	MI	14598900 / R	1553 ->1663	1553	1469.29
3	ADITYA BAJAJ	6.0	1641	955	1745	1563	1712	1666	1663	MI	14959604 / R	1384 ->1640	1384	1563.57
4	PATRICK H SCHILLING	5.5	1363	1507	1553	1579	1655	1564	1794	MI	12616049 / R	1716 ->1744	1716	1573.57
5	HANSHI ZUO	5.5	1242	980	1663	1666	1716	1610	1629	MI	14601533 / R	1655 ->1690	1655	1500.86
6	HANSEN SONG	5.0	1399	1602	1712	1438	1365	1552	1563	OH	15055204 / R	1686 ->1687	1686	1518.71

Export to CSV

Now that we have the data we want, we can export the columns of interest to CSV.

export_df <- player_df %>% 
  select(-c(1, 4:10, 12, 13)) %>% 
  select(c(1, 3, 2, 4, 5)) 

write_csv(export_df, 'chess_players.csv')

player_name	player_state	total_pts	pre_rating	avg_opp_pre_rating
GARY HUA	ON	6.0	1794	1605.29
DAKSHESH DARURI	MI	6.0	1553	1469.29
ADITYA BAJAJ	MI	6.0	1384	1563.57
PATRICK H SCHILLING	MI	5.5	1716	1573.57
HANSHI ZUO	MI	5.5	1655	1500.86
HANSEN SONG	OH	5.0	1686	1518.71

Conclusion

Calculating the pre-rating difference between the player and average player rating, we can see some players had a tougher road to reach the top in total points earned. Aditya Bajaj had the largest difference of almost -180 pre-rating to their average opponent, yet collected 6.0 total points.

player_name	player_state	total_pts	rating_diff
ADITYA BAJAJ	MI	6.0	-179.57
DAKSHESH DARURI	MI	6.0	83.71
GARY HUA	ON	6.0	188.71
PATRICK H SCHILLING	MI	5.5	142.43
HANSHI ZUO	MI	5.5	154.14
STEFANO LEE	ON	5.0	-112.14
HANSEN SONG	OH	5.0	167.29
DINH DANG BUI	ON	4.0	92.57
ALAN BUI	ON	4.0	149.14

Recommendations

Reviewing the methods used, alternatives along the way could have been done. Utilizing regular expressions and creating flexibility on the column headers. This would allow us to create a DataFrame that collected all the data provided in long format.